Mike Knoop on Gemini 3's ARC-AGI results: impressive V2 gains, but V1 'obvious mistakes' remain a mystery

Nov 18, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

our first guest of the show, Mike New from ArcGI in the Reream waiting room. Welcome to the show, Mike. Thanks.

Good morning, guys. For waiting. Good morning.

Morning.

How are you doing? uh you know uh a lot of these AI sort of like uh verification things are very uh much hurry up and wait. Uh so the last like 24 hours has been a hurry up mode.

Okay.

Always very fun and exciting to get the results out but yeah it always comes together very very quickly at the end.

Well I really appreciate you taking the time to hop on on such a busy day. Uh maybe we can just start with like your highlevel reaction. Uh is like how do you even think about these things any anymore? Are you just thinking like okay yes Gemini 3 good and then let's go a layer deeper. Are you thinking about that? What's your what's your high level takeaway?

Well yeah so you know I think the the big headline uh is that Gemini 3 basically got like 2x soda on ARC v2.

Yeah.

Um and so this is uh you know this is the third major Frontier lab now in a year to use ARC to demonstrate frontier progress particularly with AI reasoning systems. We had open AI last December

XA this summer. I'm super excited Google's now on the leaderboard too. So that's great to hear and I should say up front thank you to the Gemini team for giving us the opportunity to verify. Totally has been great. Um I think the really impressive thing about this and you know still still like sitting with all this stuff it's it's pretty fresh but I think the the biggest impressive thing to me is about we're starting to close this like complexity scaling gap between V1 and V2 ARC V1 and V2.

Like this is the big difference between what V1 and V2 is they look similar on paper. If you go look at the different data sets, the big change is the V2 kind of increases the complexity of the tasks ones that take minutes instead of like seconds for humans. Um, and so we're starting to see like actual material progress on that complexity scaling. And then I think the big surprise to me personally is that Gemini 3 though is still roughly along the Prito frontier of V1.

Yeah.

You know, it's a little better, but like it's still we're still kind of roughly within the same mass shape. And um you know there's dozens of tasks where like you know the system still makes relatively I think you know obvious mistakes that humans don't make or recognize very quickly and you know I sort of previously expected like if we had an AI system that was solving half of V2 that V1 would be fully solved and like that's not the case. So uh there there's a lot of surprise here. Uh I was dreaming about this earlier to sort of invite sort of uh investigation from the community because I think there's still a lot to learn in terms of you know h how why exactly do we see such you know a jagged intelligence emerging right now.

Let me eliminate some uh some possible factors. It feels like uh there is benchmark hacking but uh Google and the Gemini team feel not aligned with benchmark hacking generally like they've been good uh they've been good citizens in the community so far. Um, and also you would assume, right, just from logical deduction, you would assume if you're able to hack V2, you would definitely go back and hack V1 as well. So is that

this is the first time we've verified a Gemini result either this year. We we did two and a half earlier as well. So yeah, I don't think that's

so it's not like it's not like they set up like, okay, we got, you know, the most important thing here is that Gemini 3 is really good at RKGI V2. That wouldn't make sense. So there so this is sort of teaching us something about the fundamental nature of this model but we still don't know why lag why performance might be lagging in V1. Is that right?

Yeah I mean I've got my sort of hypothesis you know I think my my my personal one is that like AI reasoning systems just don't demonstrate even fluid intelligence.

Um you know the sort of like the ability for these reasoning systems to do adaptive reasoning which ARC is a sort of test of adaptation capability. it's sort of limited to domains where the underlying foundational model has pretty good training coverage over the types of data and it has a verifiable feedback signal.

Yeah.

Um and and I think that's sort of true for ARC. You know, if I if I zoom out even further maybe, you know, to kind of put put this kind of result in context of where we're at as you know, just like an industry right now. I think over the last 10 years, I would sort of characterize we've really had only two major breakthroughs. We've had the transformer in 2017. And obviously that led to language models and we had a chain of thought that was originally introduced in 2022 and sort of you know went through kestar into chain of into a reasoning systems and has gotten scaled up.

Sure.

Um and and so like this was against the backdrop of like compute scaling right and this comput scaling was certainly necessary but it wasn't sort of sufficient. These like key conceptual unlocks were sort of the sufficient things to take advantage of that compute. Um, and so my kind of take at this point having looked at all this progression this year is that like AI reasoning systems with with no new innovation from here can basically enable sort of mass automation because a lot of problems can be charact can fit into that characterization where we can generate lots of examples that look like the problem and we can get a verifiable feedback signal from them. Um, you know any problem that can be kind of cast and then characterized in that way I think can be automated at this point. No questions asked. And then the big motivating factors I think really for

mass innovation like that's that's sort of what we're still not seeing you know we don't we still need new ideas for this and I think that's closer to like an AI complete problem.

Yeah that makes sense. Uh is that is it fair to like put you in contrast to some of what Darcesh has been writing about uh saying that uh like the job of most people is not necessarily a bunch of indiscreetly verifiable tasks. Andre Karpath has been writing this as well. there's this question of like like how much of a job is actually automatable. Um radiology was one was one example um where it felt like a very automatable job and yet uh years into the AI deep learning revolution like we're still seeing full unemployment there. Uh how are you processing?

Yeah, but we're only a year into the AR easing paradigm, right? Like the first major one only came out 12 months ago and I think 2025 like in my view is basically characterized on starting to figure out how to actually bring these things into production systems.

Sure.

Um like this is a big breakthrough. I think this is the maybe like one of the mischaracterizations in my view of kind of the progress is is like a lot of teams even I think you know if you sort of just assume like oh models get better models get better you think like oh the last 12 months has just been sort of continued story and if I played with the models 18 months ago I have a rough sense of what they can and can't do and that's just not true. Yep.

Um like if you're a builder building products like this is the advice I give to you know teams I work with at Zapier too still is like look this is this actually is a significant paradigm break in terms of what was what's possible now that wasn't possible even a year ago with these systems and like that's going to enable a lot of new types of products a lot of new types of services a lot of um use cases that were like out of scope because of verifi you know because of reliability and and sort of consistency now can be brought in scope so you know I think if your intuition on like what use cases are possible based on, you know, an 8-year look back. You really have to start kind of pinning your look back to more more like 12 months.

Yeah. Yeah, that makes sense. What about uh uh like like does does the work live within SAS products or within individuals? Because some of those examples that you just gave are uh it's like for teams that are going to build products that take that automate work and then get vended in through effectively SAS products to actually do the job. um versus like a knowledge worker who is going to be using Gemini in the app to you know accelerate their daytoday uh should they be feeling a res like the difference in this in the same way

you know I mean like my one bit of advice is like if you haven't really used these areas systems much you should I would hope everyone probably who's listen to the show has has used these things at this point but in case there's not like you should go you should go use and experience these things. Um you know when Google or when opening II released GPD5 this summer with their model router right that was like

that was crazy

predicated on this data that like very few users had ever even used dating systems.

Um and I still think it's only like one in five. Yeah,

maybe it's

and that was kind of part of the Deep Seek moment was just that for the first time there was a free app that you could go and see a chain of thought and you could actually see a reasoning model in action and for a lot of people that was their introduction to that and so there was like DeepSeek wasn't necessarily that much higher that much you know in front of everything else but it just gave away a reasoning model for free at a time when they were tucked behind a bunch of other like uh hurdles that you had to jump through.

Yeah, we're still really early on the diffusion first stuff. seeing that on, you know, the huge numbers getting reported by Frontier Labs and their usage data. I mean, I'm seeing this in sales conversations I have for like, you know, Zapier stuff all over the case. We're still very much early innings on actually getting this brand new breakthrough into um like production workflows.

Yep. Yeah, that makes sense. Do you have more questions on the diffusion?

Yeah.

Issue.

One, I I wanted to get uh your updated take on on humor. We were playing playing around with Gemini 3 this morning specifically just trying to get uh on on our own little version of of humor bench. It feels like something that like I I I do think about can you make kind of these like verify like can you make humor verifiable? Like is there a system that someone could set up uh that that could um actually start um taking taking humor seriously? Because I could imagine like if if we're hitting if we're hitting like any anything close to a wall, there will be a lab that says, "Okay, well, like let's work on something that like everybody uh that like let's work on a new kind of angle for differentiation and maybe maybe humor." Uh could be

at least a little bit, right? Like I have a 5-year-old who is getting into uh starting to want to tell a lot of jokes and the jokes are just terrible,

right? Like they're not they're not funny at all. They're they're like

you end up laughing because they're so not funny and then depending who's delivering that is hilarious.

I've been trying to find the structured way to describe like, okay, here's what makes something funny. And so there is like some degree which you can kind of break down, you know, the types of things I think humans would would sort of find funny. And I like there is this actually does get pretty interesting because like you're getting to the spot where you're trying to like articulate like creativity, right? How creative can these systems be? you know, to be creative, to be humor, to generate do good art, you kind of have to like intentionally break the rules, but you need to have a really good model of what the rules are in the first place to intentionally break them. Um, and in fact, I think a lot of humor fits into this category before into this as you're right. It's like it's actually, you know, breaking the prediction rather than just following the sort of prediction of what you'd expect. Um and and today I still think when I look at the failure cases for let's call it AR reasoning systems on you know these tasks like ARC um yeah they still fail for what appear to be sort of random reasons like they they have some some version of like an understanding of like the rules and strategy and the goals and then they sort of make a lot of basic mistakes either executing them or not following their own sort of like understanding that they've generated internally. So there's some sort of self-consistency issues and so like I feel like if that's still the case you know humor is going to be accidental rather than intentional from the systems.

Yeah. Yeah.

Uh what about V3? We played around with that on the show. I believe Tyler, our intern, was uh in the top 10 for a while. Uh really grinded up the human leaderboard. Uh I is it is it more compute inensive? Is that in the process? Uh are are we expecting to see Gemini benchmarked to V3?

I would love to. So we are in the development process for V3. I uh I like to say we've basically built the like uh highest uh most productive game studio in the world. Yeah, [laughter] we're generating hundreds of these things. We're about uh I don't know like two two two/irds of the way through building all the games at this point. Our target is to get this in a good state with sort of all of our controlled human studies, all the games verified, get Frontier results checked off by early next year. Um and we're targeting releasing it publicly in V1 with the entire data set or sorry in in Q1 with the entire data set next year. And that'll likely be alongside our price 2026. Y

um still working on full details of how that's going to look next year. Sure.

Um but yeah, we're we're sort of like in the throws of it. We're definitely using some of these frontier systems to do red teaming against the benchmark just to you know assert that like yeah these games are still hard for AI and we're still finding that to be the case even with things like Gemini 3. Um but uh but yeah that's we're still in progress with development right now

and uh Sema 2. Can I have your reaction on on that? Obviously it's this Gemini AI agent. It feels like

if anyone at Google is listening to this and could sort of give me access to Sim 2, I would love to test it on V3. This is actually something that uh we haven't done yet in a

Yeah. Yeah. Yeah. That's what I'm getting at because it feels like uh I I I don't know if there's some sort of

the claims are big. You you read the marketing material and it's like okay that seems like it should solve V3 before it exists. So like if that's the case,

we should know that. And so but yeah, I haven't got haven't gone hands-on with it yet. So, I I can't sort of make any

statement either way on the claims.

Yeah. I'd be interested also to to like when I'm thinking about like V4, uh it's like you you guys are going to have to build GP GTA 6 or something [laughter] like like if I'm Yeah. If I'm following the progress of like V1, V2, V3, V4 is like a game that I'm going to play for 100 hours for fun. I'm just going to pay for it.

Yeah. This is one truth. you you've some really something true about V3 which is that it's still a relatively short time horizon tasks and they're self-contained. It does add some new complexity where you have to deal with interactivity because you have to do goal acquisition, you have to do exploration. We'll have a really nice action efficiency comparison between humans and AI which we haven't been able to get before on the V1 V2 domain. So we're going to get a lot of new signal I think on V3. Um, but yeah, I think as you sort of look even further out into the future, things that are more open-ended are the things I think we're starting to get excited about trying to like understand like what does it mean to put one of these AI systems in an open-ended environment and then look back on the system, you know, 10 minutes in the future, 200 minutes in the future, thousand minutes in the future, and can you look at the environment that that AI system has been like how it's manipulated environment and like, you know, say something interesting about how intelligent the system is based on that like observation and open-ended sense. Um, still very early on V4, but uh, but yeah, we're starting to explore ideas there.

Has Gemini 3 updated your timelines at all? Specifically, your ArcGI 2 timelines in terms of when you expect, you know, uh, sort of like the 90th 90% like anything on the kind of the upper end of the range. I was looking back at my uh the whole ARC team actually made some predictions back in January when we released V2 on what did we expect endofear scores would look like. Uh now obviously if we're only November 18th a lot happens in in AI. Who knows what the next six weeks hold. Um but my personal prediction was that we would see about 25% on the private leaderboard for RV2 on the Kaggle contest and we'd see about 50% on the public leaderboard. uh uh and and that was sort of based on the ratios we had seen from ARC Christ 2024 and you know the sort of scaling difficulties with V2 and it looks like we're pretty going to come in pretty close to that unless but barring some other major new breakthroughs towards the end of the year. um that seems like we're probably where we're going to end up the year at. Um and uh and then who knows on 2026. You I think it [laughter] if we're really going to solve V2 fully, it feels like we got to better understand why these AI reasoning systems still make sort of obvious mistakes on the V1 set.

Um and yeah, I that's that's an anomaly. So I think that's that's worth serious study uh to like come up with new ideas to sort of prove these reasoning systems.

Yeah. What was the furthest timeline that you had out? I remember you said when you developed V3, you had this framework of like like the state-of-the-art should be scoring like negative 100% or something, you were like you need to make it way harder than you think in order to give you like room to run because the systems are developing so quickly. Uh what's the furthest out timeline that you are tracking or or or you as a team are tracking? I I mean our our objective function is not longevity necessarily. It is usefulness and interestingness.

Um I think the tasks that have the highest degree of usefulness and interestingness are ones where you know oh hey this could um be useful and interesting for like three years.

Mh.

Um the ark one was useful and interesting for arguably five year. I mean even this year it's still interesting because we haven't broke it like we're still sort of within this sort of paradigm still and so it's still providing some interesting useful even though you know it's largely saturated up to 80% now but there's there's still interesting signal remaining um V2 our expectation was that it was not going to survive as long as V1 just because it was the same domain um and we had a reasoning systems in play at that point. Yeah. Um, yeah, I think our median estimate was like 24 months on V2, but like that, you know, we'll have to see how that all plays out next year with that. V3 we're hoping to put in a we're hoping to be in an environment where we can actually get that to survive sort of longer. Um,

you know, one of the interesting things we're finding with V1 to V2, V to V3 in sort of like a qualitative sense is um there's there's a there's a there's a sense of like how easy is it for us to generate the data set as like humans trying to design the tasks and design the puzzles and design the games. And with V1, pretty much every like task that like France created um was was hard for AI and easy for humans.

Yeah.

With V2, that gap got a little shorter. Actually, it got smaller. um there were tasks that we generated as humans that um AI um solved and there was other ones that were too hard for humans and so we ended up sort of pruning some of the tasks that we generated. So like the gap between those things got short. With V3 we're finding it's getting wider again

where pretty much every game we're coming up with is like fitting into this paradigm of like very obvious and intuitive and easy for humans and sort of very hard for frontier AI still.

Yeah. Um, and I think that's like uh this credit to France here, you know, this is something he shared about a year ago with a three, but he's like this is actually one interesting way you could characterize how close are we to AGI is like when we run out of when humans run out of the ability to generate interesting things of what your AI can't solve, like hard hard to argue any expert's going to say, yeah, we don't have AGI.

Yeah. Because you can sort of think about like the project of humanity is like go do the hard and novel things. So it's like is is acquiring diamonds difficult? Okay, that has value and then we base a whole economic system around it and it's like somewhat arbitrary but it's also like a skill and might and will issue and if you can put that on display then you acrewue economic value and so that that that kind of traces out into everything that we do in in life and beyond.

Last time uh you were on it if I remember correctly you you made a call for new new ideas needing new ideas. What's the update on on that front? any are you seeing anything promising outside of LLM world?

The um yeah, there's some pretty interesting stuff coming out from our crush 2025. We we in we're in the throws of like reviewing all the papers, judging all the scores, the official results for our prize 2025 come out on December 5th, I believe. So, I have to can't share everything yet. I don't want to spoil the the final announcement. I think one of the big things that we saw from ARC prize 2024 was this concept of like test time adaptation. Um this was the idea that like look a pre-trained model applied through a single forward pass at inference time will never solve ARC. You need some ability to take information from your test and incorporate it back into uh into the the system and that's where your adaptation capability comes from and that was done through like test fine-tuning during the contest. AI reasoning systems are a version of this where you're incorporating to sort of private data set

tuning. Wow.

Yeah. Yeah. literally like you take a pre-trained model and then like take the secret the private puzzle augment it in a bunch of different ways to generate permutations of it and then do like a lura or some sort of test fine tune on your pre-train and that that actually works.

Wow.

The the sort of uh the the the the common ground between this and reasoning systems is that both of them take information from the private test and are able to operate over it with it at test time. Right? This test time compute is another form of of what we're talking about here. Z 2024. One of the big things we're seeing on Prox 2025 is this concept of refinement loops. Um anywhere where like particularly with like language models being put into outer outer loops where they can sort of move from state to state and how they move from state to state is like they need to make some sort of refinement on the program or the natural language explanation of the task that they're working towards and they just iterate on this like refinement loop over and over. uh and this is significantly increasing scores even over the sort of test time fine-tuning stuff that we saw from from last year. So Jeremy Burman and Eric Pang were two folks who were on the public leaderboard last month uh that explained how their approach worked in this way. Um so we're seeing a lot of approaches like that. Um I still think we're in a regime though where like we still need new ideas. Uh none of these are sort of sufficient to solve arc um including inclusive of v1. Um and so like you know this gets me excited because I still think that means individual people, individual teams with small budgets, small compute budgets um can still play a really really massive role in advancing AI.

Yeah. Uh very cool.

Are are there other areas where uh we are making progress in AI that might sort of need to come together to uh to actually maybe solve this or maybe just be a more complete system. Uh what I'm thinking of is like uh very few solvers are that I'm aware of uh will actually just take a screenshot of the puzzle and inspect it with some sort of diffusion model. Like that's not the way these these AI models uh reason about arc puzzles. Uh we're also seeing a bunch of work on uh world world models and simulators world simulators which seem really interesting. And I was talking to one guy who is building one and he was saying like I I think that we're going to get like really really robust knowledge out of these at some point once they scale up fully. Uh and I'm wondering if you are optimistic about uh bringing in other like unifying some of the different research that's happening. I think it's um all of those examples of new research, new companies, new startups like you know there's this there was a seismic shift in 2025 from pre-training budget to uh these like RL reinforcement learning environment uh startups and companies that are generating environments to produce

uh more ground truth training data in mass way because they're you know automated environments and you can get verifiable feedback signals out of these things. Y

um again no there's no new science here like this is a good bet for like all frontier labs to make. This is going to drive progress for the next 24 to 36 months. you're going to continue to see amazing frontier headlines just just on just on this fact. There's really no new sort of I think discovery that's that's quite needed there. Um, you know, I think that if you're kind of pushing more towards the AGI side, you know, like what's what's sort of missing? Like one question I have um that is an open question is so we've got like you would think that based on like a 100x to 300x increase in efficiency we've seen from AI reasoning systems over the last 12 months that we would trade that increase in efficiency for inference tokens to do

more like search coverage over the problem space when we're giving these systems tasks or problems that we want them to solve. Mhm.

And this is one of the big reasons why I sort of expected if we can solve half of V2, you'd get 100% of V1. And it seems like these AI reasoning systems are are are like not sort of fully exploring all of the search space uh that they could in order to sort of look for solutions. Um, and so I have like kind of an open question of like, well, how much of the search base can they cover? And what do you need to change about the training methodology or process to like actually guarantee that you can get full coverage over the search base um of like possible programs or possible solutions? Um, and and so that's kind of that's like one interesting thing that I'm paying a lot of attention to right now.

Yeah. Yeah. the even just the metaphor of uh like the the test time fine-tuning. It feels like working on a problem and then like going and taking a walk and kind of like updating your whole world view like it feels like something that uh humans get get closer to doing that than any of the other paradigms. Uh so yeah, it's fascinating to see all these different uh approaches. Yeah, very

all the crazy results you've heard about in the last 12 months are kind of this merger of like deep learning and like symbolic program synthesis style methods the IC ICPC the IMO gold the Gemini 3 stuff today like you know these are all systems that are you know still fundamentally using language model but they're adding symbolic knowledge recomposition systems on top of these things they all work slightly differently okay

um but it's like

this is what's working right now and so I think the rough like search space of research and how you merge those two paradigms together is still relatively underexplored There's a lot of different ways you can put these two paradigms together.

um and uh you know for new teams that are considering work on new ideas like I would explore like well what are the novel ways you could consider merging these two spaces?

Yeah. Yeah, that makes a ton of sense. Uh Jordan, anything else?

This was great.

This is amazing. Thank you so much for jumping on on short notice. Uh and

as always, guys, thanks for having on the continued uh the continued just stacking up the wins on RKGI becoming uh

and just continuing to mog the models, mog the world.

Yes. [laughter] I mean, again, our goal is to be very useful uh and interesting. So, we're going to try to hold that bar. My words I think you're keeping them honest. I think you're keeping everyone honest. Uh and you're providing like a very very useful uh reality check on on on an industry that loves to

inspiring the labs to grind harder

and and now and now there is a there is a moment where we can uh feel very confident about taking victory laps and and and cheering for all the hard work that went into Gemini 3 because it does seem like it was a great model. It's performed well. There's definitely a big improvement today.

Fantastic. Well, thank you so much. Have a great rest of your day. We'll talk to you soon.

December 5th. We'll see you then.

We'll see you then.

Um, I wanted to talk about Adio

because Adio is an AR native CRM that builds, scales, and grows your company to the to the next level. Also wanted to talk about wander.com. Book a wander with inspiring views. How great men, dreamy beds, top your cleaning, and 24/7 concier service.

Let's sing it.

Find your happy place. Find your happy place.

Book a wander with inspiring views. I already know the song. You know the song.

I wanted to pull up this post from Chris Pariski. He did a GitHub style image of our streaming activity for the year. Oh, really? See this?

Oh, yes. I did see to him.

Should be at the very bottom. It's at the very bottom of our timeline.

I have it.

Uh, and if we could just pull up this image.

So, the internet rewarded TVPN for showing up on January 28th. That's when we went live. We never we never remember the day that we went live, but he has it. He looked it up. January 28th, John Kugan and Jordy Hayes launched a daily live show and set one simple rule. Show up 5 days a week. Looking back, they did exactly that. 125,000 followers on X, 41,000 subscribers on YouTube. Uh 17 and a half thousand on Instagram. They showed up every day the internet rewarded the proof of work.

So the only thing is these I don't am I just color blind? But is it like a little bit like I'm seeing three days that were federal holidays that we missed and then three days that were

no streams. I can't exactly tell. Yeah. What what is a federal holiday? What is a no stream? Uh the uh it looks like maybe a gray and a purple. Uh there were a couple days here and there. Um we took one off. I went to a wedding in Mexico. Uh we took a Friday off for that. That was just no live stream. Uh July 4th uh we took off. That was a Friday. That was a federal holiday. And then what happened in uh in March? We took a Wednesday off. No live stream on Wednesday in middle of March.

There was one day that we were traveling.

Oh yeah.

That was after um

after Hill and Valley after DC. Uh I thought it was a Thursday day though.

No, no, we didn't. We did. Tuesday in the in the hotel room and then and then Wednesday we did in the uh at the actual event Hill and Valley and then we flew back and got back on the horse. So we missed a couple Mondays because of federal holidays and then we missed a Tuesday in in May. That might have been Hill and Valley. March might have been something else. Anyway anyways it's been a wild ride. Uh thank you to everyone pulling it together along the way. Uh our next guest is I believe already here. Uh we have Jonathan Neman from Sweet Green. We're going from benchmarks to bench presses. The most important benchmark in the world. How many grams of protein are in your protein bowl? We

← Back to story