ARC Prize launches ARC-AGI v3: a game-based agentic benchmark where AI scores less than 1% vs. humans at 100%

Mar 25, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

to, but I will see you guys tomorrow. Yes. Love you. Cheers.

Thanks, Jordy. Uh up next, we have Mike from Ark Prize in the Reream rating room. Let's bring him in to the TDP Ultra Dome. I'm very excited to talk to Mike. How are you doing?

Here we go. Hey. Hey. Good to see you again.

Good to see you. Uh

I'm so excited. This is always the highlight of the show. I love talking to you about everything but uh what are we talking about today? Uh reintroduce uh ARC as an organization and then take us through the actual do you even call them benchmarks challenges what's the right term?

I think benchmarks is a fair word.

Okay.

Um yeah couple almost three years ago now co-ounded the arc prize foundation me and France.

Um the archives foundation has a mission uh to be the northstar of AGI. So uh sort of our sort of job we have two of them. One is to help be a useful sense public sense finding tool for the public to understand how close how far are we towards AGI or not. Yeah.

Um and the second is to inspire progress towards AGI. Um ARC is a a series of benchmarks that um help highlight what are some of the large remaining gaps between what's frontier AI is capable of and what humans are capable of. And we sort of uh uh target that gap. That is sort of uh our definition of ultimately AGI is you know we we produce an ongoing series of benchmarks um continually studying frontier progress and uh you know at some point we are not going to be able to do that job anymore. We will run out of ideas you know we'll test frontier and say we can't find anything else any more gaps and I think that'll sort of be the moment when uh I think it becomes commonly accepted to say okay yeah we we've got AGI now and today we are announcing and launching uh the the newest best version of ARC arc AGI uh 3. It's the latest in the series. This is a really large uh format change from the first two. Uh ARK 3 is designed to test agentic int intelligence and it is as I as far as I am aware and I've been sort of interviewing folks all over the AI scene in the last few weeks the only unsaturated general AI agent benchmark in the world. Um the headline score is human score 100% and AI less than less than 1%.

Okay. uh unpack that like launch because I my conception of Arc AGI V3 is it's almost like a 2D game. It's no longer the puzzles where I'm picking colors to match a pattern. It's actual uh moving arrows on the keyboard. I'm stepping on triggers. I'm opening doors, switches, that type of thing. And um I played it with you on the stream uh months ago.

Yeah. uh helped us launch our preview several months back.

Okay. So, that was the preview.

Today, we got the fin the full data set launching today.

So, so does that mean more I I'm going to call them games more more actual levels launching or is this that uh what you're launching is like you did the actual benchmark and got the four leading labs to devote the compute and actually open up their models to be able to interface with the system to get the scores.

Uh both things actually. So today um on the benchmark side uh the public version of the benchmark is uh or I guess the the overall benchmark. So over 100 games um nearly a thousand different levels across these game-like environments. I think it's fair to call them games. We've designed them to be fun and games are fun. Um I think you could look at them you know from a research standpoint more more as environments though. These environments are intended to test whether AI can effectively explore, discover its own goals, acquire strategy, develop plans, execute its plans.

One of the really unique things about ARK 3 compared to the one and two format is that it is interactive now. Whereas you mentioned, you know, one and two look like these kind of static IQ puzzles that were on a page.

Uh three challenges both humans and AI to essentially figure out the uh figure out the goals of themsel. um when you're dropped into one of these environments, your only goal explicitly is to win. And so in order to figure out how to do that, you have to actually, you know, dedicate some uh some extra regulation to figuring out the rules, the mechanics, the strategy. And um one important thing is you're sort of playing these environments. This the strategy mechanic, they grow and they evolve and they change over time. And this is one of the reasons I think AR3 will be a really useful tool for understanding agentic intelligence this year. I think it'll be our first real test um or you know seeing early progress on these AI systems that are able to do on the-fly world modeling some degree of on the like on the continual learning. These are both like critical capabilities um that we view as missing today that Arc 3 tests for.

Okay. Uh take me a little bit deeper on you you said there's a thousand games or a thousand levels.

A I think it's a a few overundred games. Okay. Uh across those 100 environments nearly a thousand levels. across all of them.

Um, yeah, it's a much larger uh version of uh of the of the benchmark than than we've ever had previously. And then like I mentioned before, the other major thing we're announcing today is Frontier scores. Okay, so the benchmark is is launching. We're also publishing as of today um the the the latest four models uh across all the four major labs. And yeah, I think Soda is currently sitting at like 3% uh point4%

Gemini 3.1 Pro. Uh maybe there's an extra hyphen it on there, but uh basically the Gemini uh Anthropic and OpenAI were all in the 0.2.3 something and then uh Grock was I think it's 0% uh walk me through uh the actual buildout of these hundred games. Is this entirely human done? Is there some sort of uh computer aided tooling to insert variation programmatically or is it important that they're all created by hand? How do you think about the creation of

I I wish we could use uh AI to help design games. We'd be able to make the benchmark even big like better. Uh the reality is um like humans are still the bottleneck on creativity and so every game has still been handcrafted and handdesigned by humans. You could sort of imagine if you were embedding all these different levels on a big manifold, you know, in an embedding, you want them all as far apart as possible in that sort of space.

And today, still humans are kind of the limiting factor in terms of ensuring that, you know, every game is is different and as novel from from from each other as possible.

Yeah.

Um there's a one there's a few interesting uh design changes actually from a benchmark standpoint compared to to one and two. Maybe the the largest um is you know, ARC studies the frontier progress. we have to design our future versions of benchmark uh to to adapt to to changing Frontier progress. Um one of the design goals with one and two was we have what's called a private and a and a public test split where we have a public version of the benchmark and a private hold out version which is what we actually use to verify the performance of Frontier models.

So so so the the Frontier models get no freebies. They don't get anything from the public set but they can try.

You can't memorize the public set, right?

Or or or or they can they can experiment on the front on the public set from like a prompting perspective maybe.

Yeah. The idea is the public set is intended to demonstrate the format. Okay.

Um and this was similar with ARK1 and two. However, we held a design goal that the public sets and the private sets uh were were called um ID with each other. Basically that they um are are supposed to be as close as possible to each other. Um and it's just uh split along visibility. Some are private and some are public.

Sure. with uh one of the the big advancements with AI reasoning is this actually like not a very useful way to run benchmarks. Uh AI reasoning systems are so powerful now that they can actually generalize across ID test splits and this is what we saw with ARK1 and two. So with three, one of the big design decisions is um we're actually releasing fewer games into the public demonstration set. So there's only I think about 25 games that are in the public set. We're actually explicitly not even calling it a training set anymore. We're calling it a demonstration set just to show the format to humans, you know, be able to test your systems to make sure you can sort of create them, get a feel for them. There's obviously fun marketing value in being able to play the games as humans, too, which we really love. Uh, and on the private set, this is the set that's over 100 100 games. Um, they're specifically different. They're they're different. We designed them with different characteristics, uh, different goals, different um, uh, intelligence capabilities required to beat them. The difficulty, uh, the acceptance criteria is more extreme between human and AI performance. all to hopefully produce the most useful like uh high signal uh benchmark towards whether we actually are getting real progress towards AGI with the foundation models.

So uh let me pitch you a strategy. If I have access and I I you know I'm at I'm at Google or OpenAI or Enthropic and I I want to do well here. Can I take the public set and uh create a log of all the steps and all the all the reasoning chains and all the key strokes that are required to pass those levels and then sort of like dump that into the context window before I go off into the unknown

um and train your model that way. basically

maybe train my model, but also I'm I'm just wondering if if that's helpful for for uh setting up the context or or or like doing some sort of like pre compaction of the strategies that are learned. Maybe not even training a custom model because I feel like that would maybe be like bench hacking. I'm more thinking about just like uh okay we we went and we played all the public games uh to completion and we and we monitored them screen recorded them tried to extract as many learnings as possible into you know an MD file basically and then we and then we include that in the prompt that that that kicks us off that to sort of bootstrap the learning once we get into the unknown environment. If we've done a good job on the benchmark, you should not be able to train a system on the public set and perform well on the private set. Um, if we've done a good job, obviously every benchmark release, it's an it it's an experiment. Yeah. Right. We make contact with reality. We ship these systems benchmarks publicly. We we try to analyze the performance, understand what they're good at and bad at and evolve, you know, future versions of the benchmark. But intentionally, you know, and this actually is a um very closely related to another design decision that we're making with our uh scoring function going forward this year. And this is again in response to like AI progress that we've seen. You know, our our scoring methodology is basically AGI field at this point. Um we uh going forward with V3 are using as I I kind of have um uh this idea of like um uh basically a philosophy of having uh essentially no harness.

Um we want to create a testing experience that's as similar as possible between the human and the AI test takers. And when we have our human baseline, when we have a, you know, rented literally a testing center in San Francisco, had, you know, hundreds of humans play these games. Yeah. Um, all they're given is uh you have sort of, you know, sensory input through your eyes and action motor output through your hands back into our testing interface. And all of the intelligence happens between those two steps. And so we try to emulate that as close as possible for our verification function where we have this sort of philosophy of having a very stateless client. So that our scoring function basically tries not to introduce any kind of bias, any kind of help, any kind of maybe potential cheating strategy. If you go read our prompt, it's extremely simple. It's like, you know, you are playing a game, here's your actions, your your conversation will be carried forward to the next turn, and that's it. Um, in order to again kind of produce this really clear signal towards when the there's real progress towards AI and the base intelligence layer, we're able to detect that.

Okay. So take me back through history a little bit because I'm surprised by why AI is struggling with this in particular because I remember it feels like almost a decade ago that OpenAI had a product I think it was called Jim where they were able to beat Mario and then they beat the Dota team Dota 5 and they were able to do things that I can't do. I I certainly can't beat Lisa Doll in Go. I certainly can't uh you know win Jeopardy or any of these things. And yet AI systems were able to dominate those games. You've created new games. What's different about the games or the strategies by the AI labs where we're not matching up like we did in the past.

I think the biggest thing is the expectation of what constitutes real progress towards hi, right? when labs were using games in maybe the 2016 to 2018 2019 era when they're very popular um you know human researchers are studying the games trying to understand the failure modes of machine learning deep learning trying to build custom search like harnesses to uh and sort of feedback mechanisms from the environments it's very very handcrafted it's loaded with what I'll call like human G right in the research process

we are now at a point where we want to control for that actually we want to understand

like we we want to control for as little human in these like systems as possible, right? We want to understand is can basically AI do what the human researchers were doing back in that era in order to beat those games that they had never been trained on or exposed to before. Interesting. Um, so I do think it's kind of elegant that you know we are coming full circle where games are these very minimal representations of like actually important capabilities that humans possess around exploring and developing strategy and world modeling and being able to learn on the fly. Um, they're really they're really elegant as far as an environment goes. Um, but I think what's changed is our expectation of how much human crafting is needed in order to uh learn the games when they haven't been specifically trained on them is is is the big difference today, especially with AR 3.

Okay. Uh, remind me of some more history but more related to ARC. I remember with one of the Arc AGI benchmark tests, uh, there was a version from of a model from OpenAI that was running on some sort of like extra high mode and I seem to remember like $2,000 per task being cited something 03 big launch.

Okay.

Yeah, that was like a preview of 03 in December of 2024. It's really the first,

you know, there's a great chart on the Arc Prize homepage now where you can actually see this data point so clearly. Um I think one of the really you know like I mentioned before one of our missions of the foundation is to try and be useful public sense finding tool.

Sure.

And um I think you know when we first launched ARK one and uh two you know it was a very common critique. It's understandable you know hey these things look like toys are they really economically useful? Are they going to lead to any you know real progress? Um and now in hindsight I actually think that's a pretty outdated view because we have pretty strong evidence that ARC held quite strong predictive power of noticing really important hit moments. We only started seeing uh saturation on the V1 benchmark and remember V1 was like 5 years old. We only started seeing any amount of progress from LMS on V1 once we got AI reasoning which was a really critical innovation I'd argue is is as important as the original transformer innovation. Um and then a year later uh this was you know 4 months ago now uh with the November December 2025 class of models with GPT 5.2 and OPUS 4.5 we again started to see saturation on ARCV v2 and it precisely correlated with this like agentic coding capability that that emerged. Yeah.

Um and so I'm hope I'm optimistic that AR3 will again be a very useful sort of predictive tool to understand when you know basically AI agents are capable of operating in more open-ended environments. Yeah. Um right now you know you need a lot of human handcrafting to get these intelligence systems to work in domains such as coding right with cloud codec and code codex and cloud code.

Um and that's we I I basically expect that like when you are doing very good on v3 uh which will mean by the way 100% score v3 means like AI can sort of beat all the games as efficiently as humans can on an action basis that will lead to economically useful systems where agents are able to operate in more open environments that they haven't been citically trained on. Mhm. Uh I I I still remember from RKGI one uh you know you see these like 3x3 grids and the first time I ever tried it I tried on my phone and I think my phone was in some weird like landscape mode or something so it wasn't rendering correctly and I was like

you didn't even get all the data points.

Yeah. No, so normally it's like you see the blocks and then you see the blocks to the left and the right and then I was like wow I'm like I'm cooked like like the fact that other people can do this. Uh but of course once you load it on desktop it's very usable. Uh I I want I want to continue down that path of the the the 03 extra high like what are you seeing from the labs that put forth models that did test on uh ARC AGI v3 in terms of just steering the models because we we we talk about GPT 5.4 but that means a lot of different things these days. Was this in the max reasoning? Should I compare this to what I'm seeing in chatbt? I'm getting more and more drop downs where I can go, oh, I can go pro and then I can go extended thinking mode.

Is it is it an offtheshelf model or are they able to sort of come to you and say, hey, we want to we want to actually marshall 10 times the amount of compute for this particular challenge.

On our verification leaderboard, we have a new testing policy. It's actually something we did have with one and two introduced after 03 where we limit to $10,000 per verification run.

Okay. Um this is somewhat of a practical like consideration. Yeah. Uh if we actually used like the most expensive highest you know million context window of the most expensive model I think testing of the full V3 private data set would be like $100,000 which is just kind of like silly right so we set a we set a reasonable limit like humans near nowhere near as much as sort of like dollars to sort of produce this same performance. I like that too because like that is the that is the like like getting AGI and it's like yes it can do anything but it costs $50 million per prompt to do one hour of human labor like that's not really economically valuable and so bounding

I think you want to know progress right and I think $10,000 is a reasonable amount of money where you will actually see some degree of progress and that will be a useful signal to start paying attention to it more.

Yeah. Um it's and it's like just you know for practical reasons we just can't we're we're a strap nonprofit so you know we have to be sort of thoughtful on our on our sort of money on how we deploy things.

Yeah that that's that's where so I think the high reasoning mode is the most we used on uh for the official verification stuff that we've used today that I mean do you spend a lot of time thinking about your own AGI timelines? Has your work at ARC shifted your timelines at all or do you feel like I've always been a 2035 guy I'm still a 2035 guy? Something like that. like do you do you have an internal model of this or is that even useful these days?

I instead of um listening to my predictions uh you should probably follow our actions as our like best sign of a sort of review of of progress. I think the reality is we have made tremendous progress with our reasoning over the last 12 months.

Yeah.

Um ARC is uh operating to bring the next version of the benchmark. We we've already started work on v4. We actually have plans written down already for V5 as well. Our intention is to bring these to market annually over the next two years. Um and so that that's sort of our expectation of having the next version ready right now.

Yeah.

Um now like will we actually launch them? I think we'll have to just see where Frontier Progress is. We want to we want the future benchmarks to be as useful as possible. And so if there's like still a lot of utility and scientific value in the current version of the benchmarks, you know, we want to keep focus on those. But to the extent that like the scientific value is starting to wayne, we want to have the next version ready that has sort of like identified, hey, are there other interesting remaining large gaps between what humans can do and AI can do in order to drive that gap to zero. You know, again, we're a very we're a very AGI pill organization. We want to see progress. We actually love seeing progress. And part of our goal is to inspire as much progress as quickly as we can to to get to these AGI systems. Yeah.

Um so I'd say like uh yeah, that that's sort of the operating view. Um will, you know, a common question would be like well is V5, you know, is V3 AGI? V4 AI D5 AGI no the honest answer and this is something I've actually learned I had a different view of this maybe three years ago the honest answer is no single version of any benchmark is ever going to be a GI I think it is a uh the frontier of progress is a moving target and our job is to like understand that the gap the remaining gap and and the definition of that gap is going to change as time goes forward in order to keep chunking up what are the largest pieces of that gap that we can find that are interesting you know that identify some missing important capability that humans are able do um and produce benchmarks that that showcase that gap.

Last question, I'll let you go. Uh what's going on with the Pokemon bench that feels somewhat related uh similar tasks? What are you learning from that? Uh how are models becoming so good at that? It it feels like they aren't specifically RL on Pokemon and yet they're learning, but also there's a massive amount of, you know, written text about what to do at every level in Pokemon. Are they just learning that from the pre-training corpus? What what what's your thesis on Pokemon?

It it certainly seems helpful. Um if I use our experience in developing ARC as a tool to sense finder on this, um we have seen more understanding from the latest generation of uh AI reasoning systems over the last three months than we saw in the first six months when we were developing ARC v3. Um I think you can kind of fork uh you can almost split the research problem of agents into two things. You can split it into a problem that says can an AI agent effectively perceive some kind of environment state apply a strategy that's written down to produce actions and you know successfully execute a plan. That's half the equation. The other half the equation is can you have agents that are effectively able to develop what that plan is. And to do that, you need to be able to on the fly build like a world model of your your task, acquire goals, create your strategy, create your plan. We've seen a lot more progress on the uh perception through strategy to action problem than we've seen on the um the exploration problem, the strategy generation problem. And I actually think this is one of the areas that that I would point interested Arc 3 researchers at because I think it's a lot more green field um and will unlock a lot more progress even on you know things like Pokemon bench uh where it's kind of coming down to like okay we know they can sort of I should say at execution um the exploration and planning step is still where there's a a large degree of bottlenecking still still happening today.

Well congratulations on the progress. Uh where can people find it? How can people participate? How can people help out?

Yeah. Uh, go to arcprize.org. Uh, you can play the games as humans. Like I said, we've got almost 25 of them, I think, on the site that they're all they're all designed to be very fun. We explicitly controlled for this actually when we were doing human baseline testing. So, that actually should be fun. You can have fun. And you can also get details there. Enter ARP prize 2026, our new $2 million uh prize uh pool uh this year that's on AR 2 and AR 3.

That's amazing. Yeah, our uh our our teammate Tyler Cosgrove was climbing the human leaderboard for a while. I imagine he's been knocked off, but we'll have to get him back on top. Uh, thank you so much for taking the time to come chat with me. Uh, this was fantastic. We'll talk to you soon.

Have a good one.

Let me tell you about Gusto, the unified platform for payroll benefits and HR built to evolve with modern small and mediumsiz businesses. And let me also tell you about Turbo Puffer, serverless vector and full tech search built from first principles and object storage. Fast, 10x cheaper, and extremely scalable. And without further ado, we

← Back to story