METR's Frontier Risk Report: AI models cheat 1-in-6 times on long tasks, and Anthropic's monitoring was jailbroken by an embedded auditor

May 19, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

Speaker 2: Great to see you.

Speaker 1: You good? Thank you. Congrats. Have a great rest of day. To you too.

Speaker 5: We'll talk

Speaker 1: to Appreciate you it.

Speaker 2: Goodbye. Next Lion.

Speaker 1: Ajeya Cotra from Meter joining the show to talk about their new Frontier Risk Report, which came out today. How are you doing?

Speaker 8: Good. Thanks for having me on. Great to be here.

Speaker 1: Thanks for having me on. Don't you start with a little bit of your background, maybe an introduction on how you fit into Meter as an organization and maybe even just reset on like an introduction of Meter and what the purpose of the firm is, the structure of the firm.

Speaker 8: Yeah. So my name is Ajeya. I actually joined MITRE pretty recently to lead the writing of this Frontier Risk Report in January. Yeah. Before that, I'd spent about a decade in AI safety in a couple of different capacities, all at Coefficient Giving, which is a big funder of AI safety work. Sure. A lot of my work had been kind of bigger picture

Speaker 1: k.

Speaker 8: Forecasting, longer term, like, when are we gonna get super powerful AI? What's gonna happen with the world? What kind of risks might it pose? And at Meter, I'm I I really like that Meter's mission is to kind of take that stuff seriously, but then try to make it measurable. Yeah. Like, try to make risks from misaligned AI something that we can track and do the best possible job as civilization, like, getting on the same page about.

Speaker 1: You know?

Speaker 8: So so I I see that as having two parts. One is developing the measurement tools, so the the telescopes and the microscopes and the instruments we need to understand what are systems capabilities, what are their motivations or inclinations, what are the incidents we've seen of them of things going wrong Mhmm. And where is that all heading with the trends.

Speaker 1: Yeah.

Speaker 8: And then the other side of that is to actually apply that to real frontier deployments and try to understand the risks posed by a particular system in partnership with companies.

Speaker 1: Yeah.

Speaker 8: And the frontier risk report is is sort of that half of it where Meter, for the first time

Speaker 6: I about

Speaker 1: to

Speaker 8: has done a sort of cohort thing with a bunch of different companies working with Google, OpenAI, Meta, and Anthropic Yeah. Where they gave us access to their best internal models sort of on our terms and answered a long questionnaire we sent them about, you know, how they align these systems and what incidents they saw with them and how they use them

Speaker 1: Yeah.

Speaker 8: Also that we could kind of pull together almost like a state of the union of, like, what's the deal with misalignment risk Yes. Inside these companies.

Speaker 1: Yeah. And so how are you trying to quantify the actual findings? Is it like a number of incidents or magnitude of incidents? It feels like it can be very abstract, but the whole purpose of meter is to sort of quantify, narrow down, contextualize. And so what were the goals or were the goals, you know, after you actually get access to the models, you act you get these questionnaires back, you see the internal reasoning change. Is are are are venues starting to construct benchmarks around those? Or is it important that you come in with your sort of metrics pre baked so that the access doesn't change what you're measuring?

Speaker 8: Yeah. That's a good question. And it's definitely a mix. I think we had, I would say, basically three big goals. The first one was really to just do a dry run of a process for what good auditing of risks could look like. Yeah. So most third party evaluators, including Veeder in the past

Speaker 1: Yeah.

Speaker 8: They sort of you know, a a company is about to release a model in two weeks. Mhmm. And they call you up and they say, you run some evals on this model? Mhmm. You kind of scramble to do two or three evals. Yeah. They put out the model. They put your evals in the system card. Yeah. And we wanted to do something that was both deeper and and kind of driven by us as opposed to tied to launch schedules.

Speaker 1: Yeah. And so really quickly going back to, like, evaluating the older models, like, what what does that actually look like in practice? Is that like, you know, give me the, you know, give me instructions for how to build a bioweapon. And that's like just the prompt and then you're just seeing if it rejects that properly. Like, what what are some examples of of evaluations that you would do prior?

Speaker 8: So, yeah. So you're talking about red teaming

Speaker 1: Yep.

Speaker 8: Which the UK AI Security Institute does a lot of this Mhmm. Where, yeah, the the company will be like, will this model tell you how to make a bioweapon?

Speaker 1: Yep.

Speaker 8: You you have a week or two. You try a bunch of jailbreaks.

Speaker 1: Yep.

Speaker 8: You generally just get output access to the model. Sure. So you can't necessarily go super deep.

Speaker 1: Yep.

Speaker 8: And what METR used to do is dangerous capability evaluation. So it's not even the jailbreaking piece It's per just what can this model do Oh, sure. Autonomously on its own. Yep. So we're best known for for our time horizon chart Yeah. Which is plotting models that with the x axis being their release date and the y axis being how complex of a task can they do by themselves Yep. Measured by how long it would take a human to do the task. So we we released this in spring twenty twenty five. Models were, like, a time horizon of less than an hour. And now the best models have a time horizon of more than two full time equivalent days.

Speaker 1: Yeah.

Speaker 8: So, you know, a lot of the time they can do software tasks that a human human would take days to do. So so that was our lane. It's like capability evaluations. Yeah. With this report, we're we're trying to expand into two different verticals at the same time as we're kind of expanding into deeper access. We're we're calling it means, motive, and opportunity. Mhmm. So means is the capability piece of it, which which Meter has the longest history with. Motive is understanding based on how these systems are trained and based on what we've seen of things that can go wrong in in real deployments, what what are their tendencies? Like, under what circumstances would they misbehave?

Speaker 1: Mhmm.

Speaker 8: And can we get better at predicting that? And then opportunity is the whole system surrounding the agent in terms of what are the operating conditions? How are they used? How are they overseen? Are they subject to monitoring? Are they subject to security? And therefore, like, could they get away with certain harmful actions or would they be stopped?

Speaker 1: Mhmm. And as you I mean, I'm I'm I'm interested in more of, like, yeah, the the actual findings, like the state of the union on, like like what are the capabilities, where are we on actually mitigating misalignment. And then so let's talk about that and then I want to know downstream where all this goes and where you'd like to see standards sort of emerge.

Speaker 8: Yeah. And so that kind of goes back to your question of, you know, did you kind of come in with the framework all baked? Or did you kind of discover it as you did the report? And I think it's very much the latter. We knew what types of information we wanted to gather. We knew we'd want to know about incidents and how they train the system. We kind of prepped this whole questionnaire Mhmm. Before the process even started. Mhmm. But then as we were writing the report, this framework emerged of of basically a two dimensional scale of AI misalignment incidents Yeah. Where one scale is what we're calling overreach, which is how far past the bounds of where this AI was supposed to stay did it blow past. Mhmm. So we have three buckets of this. Yeah. One is it just violates user instructions and goes and, like, does something it's not supposed to do. But there was no actual, like, hard barrier that it had to hack through or anything like that. So an example of this is in one of our tasks, Opus 4.6 ran out of API credits in the account we gave it to do a task. So it just, like, went and found free compute online, like, against explicit task instructions.

Speaker 1: Yeah.

Speaker 8: But but we didn't, like, have a security barrier. Just kind of, like, went on the Internet and found something and set it up. Yeah. And the next level of overreach is when an agent actually hacks past something

Speaker 1: Yeah.

Speaker 8: Like a like an actual security perimeter.

Speaker 1: Yeah.

Speaker 2: Yeah.

Speaker 8: And we find that on some of our tasks, agents are constantly trying to, like, break out of their sandbox and find the file where we, like, put the test so they can get the answer key.

Speaker 1: Yeah.

Speaker 8: So so on our we're we have some of the hardest evaluations around. So most people evaluate models on, like, pretty short tasks that are pretty easy for them. And we have tasks that are, you know, eight, ten, twenty hours long. And on tasks longer than eight hours, models cheat more than one in six of the time. So imagine an employee that like Yeah. You know, one time in six just like flagrantly tries to like Yeah. Steal from you.

Speaker 1: People take the shortcuts on the longest path. Yeah. They don't bother to take shortcuts if they're just going to block.

Speaker 8: Yeah. Yeah. So and so on our shorter tasks that are like thirty minutes, we find the cheating rate is half a percent Interesting. Which is similar to what companies report in their system cards. But on these longer tasks, it's one in six. And on some distributions, we have this dataset called mirror code, which is basically having AI systems reimplement big pieces of software. Yeah. And Opus 4.6 on hard tasks in Miracode attempts to cheat 80% of the time. So they're just desperate. They're just desperate They to know that the test cases are there. They they want to overfit.

Speaker 1: I think I'm thinking of the wrong of a different benchmark, but Meta put out a a a it sounded like a somewhat similar benchmark of, like, rebuild a full complex software repo. Yeah. And and I think all of the models were like half a percent, like basically again back at zero sort of like an Arc AGI v three or some of the meter tasks that you have that are not not passing at all. And for that, I I, you know, I'm I'm like even intuitively, I'm like I would just clone the repo and start there. But of course that's cheating. Yeah. And so it's very intuitive that if your boss comes to you and says like I need you to rebuild Chrome. You'd be like okay well I'm starting with Chromium and then I'll add some features. Like this is a very logical path that I sort of empathize with the models that they cheat in this way. Yeah. But let's move on to, like, where this goes because I think that there's an immense I mean, you've seen the you know, Eric Schmidt getting booed offstage for talking about AI. There's a lot of AI anxiety. Data centers are being imposed. There's a bunch of calls for an AI FDA or some sort of I think a lot of the model providers, maybe not all the ones that you've worked with, have signed on to let the government review the models.

Speaker 4: Like, where

Speaker 1: do you think this goes? How do you do you want this to remain in the private sector, formalized further, build MEETER as an international organization? Where is the energy going? Where is their demand from the folks that you talk to?

Speaker 8: Yeah. So, Meter is very interested in and our partner companies are interested in setting up basically a sensible auditing regime that is technically literate

Speaker 1: Sure.

Speaker 8: For these catastrophic risks. Mhmm. So, you know, you don't want, like, a box checking auditor that has, like, sort of 17 arbitrary things you're supposed to do Yeah.

Speaker 5: The model

Speaker 1: is gonna find those boxes and check them. Yeah. The model is gonna find that auditor hack into their their checklist and check everything. We know what we know what happens here.

Speaker 8: Yeah. So it's like we're we're in this, like, weird situation where the science is, like, extremely nascent and fast moving. Yeah. But then also the risks might be kind of imminent. So we need, a flexible system. Yeah. And and and my best guess is that it's gonna look like something like what happens in the financial sector in some cases where you have embedded auditors. Mhmm. So you have other folks who are who are experts in finance who, you know, sit and eat lunch with the employees and see all the books and know everything and have a lot of flexibility to investigate what they need to investigate. And we actually released details on an embedded auditing exercise we did with Anthropic as part of this report Mhmm. Where a meter employee, went in for three weeks, and just tried to break Anthropic's monitoring system. So he just sort of played the role of a rogue AI and and tried to wreak havoc and tried to break things. And he found several ways to jailbreak and disable and evade the monitors. And that's not something you can get just from, you know, sending out a a form and having them fill

Speaker 4: it out.

Speaker 1: So Yeah.

Speaker 8: Yeah. The work we're really hoping to move more and more in the, like, embedded direction. So embed embedded auditing of the monitoring system

Speaker 4: Sure.

Speaker 8: Like we did with Anthropic, potentially even embedded auditing of training. So, like, getting getting samples of what the system was trained on, analyzing the training incentives that might have been created, trying to figure out if the training data could have been poisoned even.

Speaker 2: Yeah. Does this, you know, when when you say auditor, I think, you know, potentially like for profit business, would there be a possibility that

Speaker 1: yeah. Is like not a joke. All the financial auditor companies are huge.

Speaker 2: Yeah. So is there a possibility that

Speaker 8: that the way.

Speaker 2: Is there is

Speaker 1: there But maybe it makes sense. Maybe it's actually a better Yeah.

Speaker 2: I'm saying is there a possibility in the future where Meter has a for profit, you know, auditing arm that you maybe you guys spin out?

Speaker 8: So I don't I don't know what the future might hold, but Meter does not take money for our engagements with companies and that's very important to us because we want to have our scientific independence. Yeah. Although, you're right.

Speaker 1: A in the regime PricewaterhouseCoopers is like a successful Yeah. I'm just

Speaker 2: saying like in

Speaker 1: a If if you

Speaker 2: want auditors that are technically competent that have been working with the models for a really long time, there's not a lot of organizations outside of METER that would be qualified to do this kind of work. You might you might be

Speaker 1: It's the final alignment problem for you. Good luck. You have

Speaker 8: to see might wanna you might wanna maybe, like, split the the auditing from the scientific judgment maybe. Sure. One one thing I like from the nuclear space is that the nuclear power plants actually rate each other's safety.

Speaker 1: Oh, yeah.

Speaker 8: Which is like an interesting I could imagine Meter kind of like digging up information and then like OpenAI rates Anthropic and Anthropic rates OpenAI and GDM.

Speaker 2: I'm sure I'm sure everyone Yeah. Will be able to do

Speaker 1: Yeah. Yeah. Much more drama. Just fired shots. It's over. I'm sure the post will go viral every time. Well, thank you so much You for coming on the can go find the report on METREX account, m e t r, underscore evals is the account, and metr.org is the website. Thank you so much for coming on the show. We'll talk to you soon.

Speaker 8: Yeah. You so much.

Speaker 1: Bye bye. Have a good one. Goodbye. Our next guest is live with us in person. We have one post we need to pull up first. There's some news from Micron Technologies. The stock's been on absolute run, but recently it traded down eight and a half percent. It's just $664 a share and talent chimes in and says, I knew this was going to happen. That's why I sold at $120. Very silly.

← Back to story