ARC Prize launches V3 interactive benchmark and $10K agent contest to challenge frontier AI systems
Jul 18, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Mike Knoop
but better, folks. And now, Wander just dropped Wander Indo Haze. Haze, h A Z E. Oh, okay. Uh, well, we have uh Mike from Arc AGI coming on the stream. And we have the ability Arc Day. Arc Day. I I I love these releases. Thanks so much for always coming on the show. Um, of course.
Thank you guys for being such huge supporters. Means a lot of this project. It's it's so amazing. So we have the latest and greatest.
Uh why don't you uh just give us the a little bit of background on um the the first of all I think why don't we talk about the XAI launch last week because that was a you know pretty massive improvement. Um so maybe start there and then we can get to the present. Yeah.
So, this is really, I think, the second time we've seen a major frontier um AI lab use ARC as a benchmark to show off some sort of frontier of progress. Right? Back in December, we had uh OpenAI uh use ARCV1 to show off this qualitative change, right?
Really marked the moment where AI Frontier AI research moved beyond just like scaling up, retraining, and starting to add these symbolic systems on top, these chain of thought reasoning systems. Um arc really marked that moment.
And then uh last week uh the XAI team used ARCV2 to show off uh a Frontier result on on ARCV2. They got 16% on the benchmark. Still early days.
Um but I think one of the really big takeaways um from that from that was basically like how you know using a lot of the existing ideas in the world but how quickly and effectively XAI was able to catch up to the frontier.
And so, you know, my mental model now is, you know, going forward is, you know, wherever the front sort of innovation comes from, my expectation is XA is probably going to go beat for beat on on catching up just given how fast they're able to get there this time around.
When when I hear some like one one thread that I've heard from watching the ARC progress is uh you put more more test time inference, more spend behind a particular model, you get better results. And it and it and it begs the question like is this brute forcing? Is that what we're experiencing at some level?
Is that is that a fear? And is the latest ARC v3 uh an attempt to kind of avoid that or is that just like not an issue at all? So, so no because of how we evaluate top scores on ARC. So we publish along two dimensions. One an accuracy score.
M um but we also publish an efficiency measure and this is not arbitrary like efficiency is a really really fundamental aspect of what it means to be intelligent.
So when we you know we have the million-dollar prize it's still hanging out there by the way for the original version of the contest and you know if you look at um you know the human level efficiency scores just on V1 we're still only around you know 60ish percentage points whereas our benchmark for humans is 85 and and onwards.
So, you know, even though we've got two team 83, 85% or something, would they win the million or do they have to get to 100%. Uh, so the Kaggle contest rules are a little more stringent.
Um, you have to do it on a certain performance profile, a limited compute budget, you have to get the high score and then there's an open source requirement.
So this is one of the things that um one of the sort of principles of the arch prize foundation is we're trying to basically accelerate AI genesis by encouraging more people to work on new ideas openly share those ideas try to you know kind of shape the I research community to look more like what it did during the 2010 to 2020 era than than maybe it has over the past you know three or four four years or so.
What about um I we we talked to somebody I forget who they they said like oh yeah like uh you know RKGI it's cool but like we could totally crush that if we just like RLED on it directly. Uh and all the labs are just being like nice to like keep it as an independent benchmark.
But I don't fully buy that because I feel like some hedge fund would just see a million dollars on the floor and then just go do that if that was the case. But uh what what is the what is what is the vibe? reality is like benchmarks are marketing. Yeah. Right.
Uh this is like one of the reasons I I didn't appreciate this 12 months ago when we first launched AR. I do now. Um you know the the whole reason I put the money into the contest in the first place was ARC's awareness was very low and my thesis was this is the most important unbeaten benchmark in the world.
It tells us something important that no other benchmark does. Right? If you go even look at the Grock 4 stream all the other benchmarks that were shown are like PhD++ level benchmarks. Y and yet arc tasks are like, you know, we have objective evidence.
We've done human controlled studies that show they're all solvable by sort of average humans. Um and so I think that tells you something interesting like well okay yeah like we're clearly missing something big here. We don't have it all figured out. We're not in just like a scaling up regime.
We are in we're sort of in an idea constraint regime. And I think that's an important conclusion because if it's true it means that individual researchers and small teams on small budgets can actually have a significant impact on the frontier of AI research.
you don't need, you know, a million, 10 million, billion dollar training budgets in order to actually make a material impact on AI. Yeah.
The the the challen like I mean the real challenge is just the distortion in the market right now and the trade-off that a that a researcher even somebody who's let's say a 20-year-old in college that could be doing this sort of independent research and they're like, wait, if I drop out, I could be making $500,000 a year base comp and and what you know the the equity on top of that.
So I imagine it would I imagine at some point are are there like high schoolers that that are that are doing this kind of thing because like at a certain point at a certain point like the target market is like people that like are maybe a little bit too young to like actually get it like Silicon Valley companies will happily have somebody drop out of college.
It's a different conversation when somebody's like I want to like drop out of like high school, right? I mean this for forms my thesis. I think talent is very distributed globally.
Um if you look at most of the teams that are on the leaderboard from you know past years of the art contest you know a lot a large percent I'm not sure if it's over 50% but it might be are outside the United States.
Um and again this is like you know if you're sort of trying to create a sort of optimization function for creating AGI uh you know you you would like to shape an innovation environment that's very open there's a lot of sharing and there's a lot of diversity of approach.
Um the opposite would be like there's very little sharing, everyone's working on the same ideas. Um like and if because if those ideas are wrong then like okay you're sort of shooting yourself in the foot.
And so um those are I think that's that's one of the reasons we sort of launched our prize in the first place was try to try and help communicate the story that individual people young folks without with very little budget can actually have a large impact and encourage them if they have new ideas towards AGI to go work on those you know maybe as opposed to going and you know just starting the next language model uh startup.
Yeah. Yeah. When when Doresh came on the show, he was talking about the need to solve continual learning. This idea that he has uh you know this amnesiac PhD that is un unable to learn hard lessons and then roll that up into habits and kind of wisdom almost.
Um and and and that's why he he's unable to, you know, use any of the frontier models to uh his example was like select which clips of the podcast will perform well on on social media or something and he was struggling with that. And I'm wondering like we hear about the spiky intelligence concept.
Do you think that the the problems that underly uh Arc Prize's robustness are related to the same continual learning problem that um that Doresh was highlighting or are these two separate problems where we could see us solving one and not the other? I think that's happening, right?
The scores are much higher in V1 than V2. I I so if I kind of lay out and tease here kind of the the version we're doing a public preview for today, um you know, we've got V Arc V1, V2, and V3. V1 was introduced back in 2019. It was designed to challenge deep learning as a paradigm.
Remember this is before language models years before sort of language models really hit any sort of stride in terms of the research. Um and it uh sort of was robust through that uh advancement and that's because language models sort inherit some of the same fundamental limitations that pure deep learning do.
V2 was designed to challenge this new paradigm of AI reasoning systems. Um it's still a static benchmark. So the puzzles look very similar to V1. It it might actually be kind of surprising that like you can't beat V2 if you can beat V1 because they look like they're in domain from each other.
But totally the the intuition here is generally the V2 puzzles require um longer reasoning chains generally to solve them. And so that gets harder to do.
Um one of the things that we started to see though this year is really the emergence of a lot of these agent systems that are being placed into dynamic open-ended environments. And while static reasoning, I think benchmarks are useful and will continue to be useful.
Um, this is one of the motivations for starting to build um, B3 in defining what we're calling an interactive reasoning benchmark to help evaluate and really challenge some of these frontier AI agent systems we're starting to see emerge. Okay. So, should we do the live demo?
I think we can giant like chair launch today, I guess. Uh, yeah. So, so I can I can kind of read through this, too. Yeah. Give us the overview and then and then we'll play. It's a little It's a little unique for us because you might be surprised like, "Oh, wait. V3 is launching. Didn't V2 launch like 3 months ago?
" Um, so today is a public preview. We're we're showing off the first three public games from the eventual data set. We're building them this year. We intend to launch the full version in early 2026.
We're we're going about the launch a little differently than we did with V1 and V2 because V3 is such a big upgrade over V1 and V2. We want to get And by upgrade you mean by upgrade you mean significantly more challenging.
the gap between what's easy for humans and hard for AI is getting wider again with V3 compared to what the earlier versions were, which I think is one of our other design principles we have. Um, and so it's and it's also just like a very different domain than V1 and D2.
They look like arts, what they're dynamic, and there's a lot we don't know about them quite yet. Both what humans can do, what they actually find easy, um, what AI agents can do, how much can you create, you know, custom harnesses and scaffolds to be able to maybe make progress. I think we're going to learn a lot.
Um, and this is why we're sort of launching the first three games early to make contact with reality here and increase already learning over the next maybe month or so.
On our game design, our API design, we actually have we launched our first piece of infrastructure uh today as well, an API that you can actually build agents and go run against these first three games.
And we launched a $10,000 agent contest that's running for the next month for whoever can build the best agent that gets the best uh top score on the the games that were. So even if they get 1% but they are the top even if it's low whatever it is money is going out the door.
One important thing like V1 and V2 ARC V3 has a public and private data set. So we've got three public games. There's also three private games. Those are actually what we're going to be awarding on top score performance on.
So, if you're thinking like, "Oh, I'll just make a really good, you know, harness for the three public games. " It won't work cuz they won't translate into the three hidden games. So, smart, clever, clever. I love it. Uh, so, uh, I'm I'm sharing my screen to the stream.
I don't know if you'll be able to see it, but I will read through this. So, uh, my suggestion is you guys should we should play Locksmith. This is LS20, our first game. I think you guys should just collaborate on this one game together. It'll take about 5 to 10 minutes to probably play through.
And I think seeing both of you like work together on it live, I think, will be a fun, entertaining. I should be able to see the screen. Okay, I'll I'll I'll read it out. So, um, human instructions. You are playing a game ID. There are no instructions intentionally.
You must play the game to discover controls, rules, and goal. Press start to play. Choose your controls. WD or arrow keys. Uh, play to learn the rules of the game. Win the game. Profit. Just kidding. Just kidding. No prizes here. So, I click start and I'm presented with a large grid of squares.
It looks like the most intense Arc AGI puzzle possible because the original Arc AGI puzzles were something like a 3x3 grid and now I'm seeing they're all 64x 64. 64x 64. Okay. So, um I do have the ability to use um arrows to move this blue and orange block. And Jordy, can you see me moving it around? Yes. Okay.
So, if I go on top of this, the bottom left hand corner updated slightly. Let me see if I can zoom this out a little bit so that All right, you've learned one important thing. Um, and if I keep moving, click seems to do nothing. Spacebar seems to do nothing.
I'm going to give a little commentary while you're going here, too, John. So, so, you know, the data set, the three games that we launched today, all of them are completely different. None of them look similar.
In fact, this is the only game in the set that looks like a 2D agent game from kind of a tops down camera view that we're launching. uh right now. Okay. All the other ones are quite different.
And this is actually a design goal of the benchmark is for um all the you know eventual hundreds of games we're going to have for them to be entirely different and very novel and diverse from each other. So I I appear to have lost a life. If I look in the top right I now have three I I had three red dots.
Now I have two reset. I got a red flash. And so uh I think I died. Um but if I move forward I get a green circle which I imagine means I won. And now I'm on a new level and there's Do you know why you like got to the new level? Can you articulate that yet? Yes.
So I I believe that I stepped on a button that rotated my or or kind of rearranged the the icon the goal icon in the bottom left of the screen.
Um, at first I thought that the that the in the bottom left I see a little like like blue and white line and I thought the line was like a map that I have to take, but it appears to be a puzzle shape that I have to match up with a puzzle shape that's on the grid somewhere. And I'm stepping on a button to change it.
So when I step on this rotate or change button, I'm getting kind of like a different Tetris piece. And if I keep doing this, I might land on this. Okay, that matches now. So the bottom left matches the the little goal. You see the goal? I go over to it. But if I go over to it, I die.
And so I think I ran out of purple steps. So I have a set of There we go. Purple like like uh energy. I have energy. Yeah. Yeah. I I I have I have energy. So now I'm going to do the same thing. But this is another very important uh design goal.
many of the games in ARV3 inherit which is there's intentional efficiency limits on actions that you can take.
So now green score I did it which is smart because the idea of like you know Scott Woo can probably like oneshot a math problem and it's very efficient for him and then someone else like is going to be able to solve the same problem but but like over weeks and is that really the same level of of intelligence?
This is really, you know, you guys have probably like for for for a long time actually, I think games were considered a solve problem in AI, you know, with Alph Go and all the chess games and yeah and one of the reasons for this was like they they didn't the only thing that stopped them from being like totally superhuman in those is that they just didn't scale the current algorithms enough and so they didn't solve you know it completely but they all were basically most of them use RL and so they're trying to take reward signal and understand you know what actions I took to produce the reward signal.
This is one of the things that efficiency helps with is it limits the ability for an agent to just naively be able to go ex gather a reward signal by spamming and playing the games hundreds of thousands of times. This is something humans don't need to do, right?
You know, you already beat level two uh in what less than 5 minutes here with a very limited number of efficient, you know, actions that you took. And this is something we don't see from the frontier like LM state or or other agents we've been testing. Okay, I got my next level.
But all right, the next level we we started introducing some new concepts here. So, so I'm seeing they're different colors. So, I I have blue and white and I need blue and y or blue and orange. And so, I'm going to step on this uh this color block. Now, I have blue and teal. Now, I have blue and red.
I'm doing okay on efficiency. Have about half my life. Okay, that seems good. But I think if I make a run for it, I won't make it. So, I'm going to pick up this purple cube to refresh my energy. Run over here. Am I going to make it? Am I going to make it? I made it. There we go. Yes, there we go. Victory.
Okay, now that there's a different one, I'll start by changing the color. Okay, I nailed the color. Blue and teal. That's the end gold and the black cube. Then I need to switch my icon from this one Tetris piece to a different one. Okay, I got that. No, that's not it. I need the the little like chair block.
Okay, that's You're running out of lives, John. Roughly correct. Let's see. Okay, I refreshed. I refreshed. Go over. I think this this other block. I think this is a button. I think I'm gonna run out of lives though. I need to go refresh. I refresh just in time. I one away. One away. Okay. Let's keep rotating.
Keep rotating. Okay. Now it matches. There we go. Okay. And I'll just pick up the free energy. Boom. Green. You might have noticed scaffolding new uh new things you have to learn, right? There's a progression system here. It's not just learn one rule in level one and apply it for the entire game.
But we found a really an element one design goal is that all the games are fun. Yes. And one of the things we found where we're doing early design game design was that um folks did not find the games fun if they just took one rule they learned and did that just repeated it. Right.
So introducing new things you have to continually learn throughout the game is a big function of whether humans can find these things actually entertaining and fun. Well, we we should figure out the infrastructure to have pure PvP speedruns of the entire prize. the whole team.
Um, real real quick while we have you, we have our next guest in the uh in the waiting room, but uh I wanted to ask you because it is top of mind for us this week.
I I think like the broader tech community went from sort of like not like like taking like AI safety and alignment, you know, super super seriously or kind of making jokes about it until this week. I think AI psychosis has been top of mind for a lot of people.
If you were running one of these scaled labs today, how how would you be trying to sort of quickly react to some of the um some of the different kind of stories that seem to be bubbling up around um people just getting like too, you know, too deep in this sort of recursive prompting.
You know, my my sort of view generally on a safety stuff is you want to be empirical about it.
Um, you know, this was sort of my big issue when we were going through all the 1047 legislation last year in California is trying to make predictions about what future harm might happen by being able to predict the future and in some cases poorly predict the future.
I think even ARC, you know, was de sort of clear demonstration of an eval that suggested we were not just scaling up pre-training. AGI is not just going to emerge from scaling up this pre-training regime.
I mean that was sort of the predicated thesis on why we needed something like 1047 at the time to like stop this imminent urgent potentially dangerous scaling. Um so now I I actually think uh you know my my sort of view probably aligns quite closely with opens.
I think you actually need to deploy the technology into the environment into the world in order to make contact with reality and learn what are the actual issues that you care about what does society care about.
Um my my sort of view is that like uh society is actually better at dealing with um fast change than slow change. Uh, you know, this is another kind of counterpoint that I think if you go look at the safety community would argue, oh, slow slow takeoff is better than fast takeoff.
And I think in a lot of capabilities actually fast takeoff is slight is is perhaps desirable because humans notice change. This is like literally what we were evolved to do is in our environment, we notice when things change fast. We don't notice when things change slow, right?
The frog boiling in the pot is sort of the classic here, right? Um, and so I think fast change stories like this, issues with psychosis are good in in a weird way because they uh they um are sort of societal antibodies of like, oh, hey, something changed here. We should react to it. Yep.
No, that's really that's kind of my sort of broad framework of what what what I'm what what I think how we should sort of like run these in in some ways the the the negative externalities of social media like let's say like somebody's developing body dysmorphia like the the the ways the social media spectrum changed from like sending like group messages to suddenly like you're sharing your entire life like it actually was very slow and so I think that this development with uh with that creeps in right you wake up 10 years And you're like, hm, are we happy with like where we got to?
Brain rot is a meme that took years to develop. Fascinating. You have to come back on soon. We could go way deeper. We're going to be playing this all day. Hope you guys had fun playing the first game. Fantastic. I was not expecting a game. I was expecting a puzzle and we are