Interview

Mike Knoop on Gemini 3's ARC-AGI results: impressive V2 gains, but V1 'obvious mistakes' remain a mystery

Nov 18, 2025 with Mike Knoop

Key Points

  • Gemini 3 achieves roughly 2x improvement on ARC-AGI V2, making Google the third major lab after OpenAI and xAI to use the benchmark for public reasoning progress.
  • Despite V2 gains, Gemini 3 fails on dozens of V1 tasks with obvious mistakes, revealing a jagged capability profile that contradicts assumptions about AI reasoning scaling.
  • Refinement loops, where language models iteratively update programs toward solutions, are outperforming test-time fine-tuning as the dominant architecture for frontier reasoning systems.
Mike Knoop on Gemini 3's ARC-AGI results: impressive V2 gains, but V1 'obvious mistakes' remain a mystery

Summary

Gemini 3 posted roughly a 2x score improvement on ARC-AGI V2 compared to prior frontier results, making Google the third major lab — after OpenAI in December 2024 and xAI in summer 2025 — to use ARC to publicly demonstrate frontier reasoning progress. The result was independently verified by ARC Prize, which had previously verified a Gemini 2.5 result earlier in the year.

The V1 Anomaly

The headline number masks a persistent puzzle. Despite the V2 gains, Gemini 3 remains roughly within the existing Pareto frontier on ARC-AGI V1, still failing on dozens of tasks with what Mike Knoop, co-founder of ARC Prize, describes as obvious mistakes that humans resolve quickly. Knoop had previously assumed that any system capable of solving 50% of V2 would fully saturate V1 — that assumption has not held.

His working hypothesis is that AI reasoning systems demonstrate fluid intelligence only in domains where the base model has strong training coverage and a verifiable feedback signal. Outside those conditions, self-consistency breaks down: systems articulate correct rules internally, then fail to execute against them. Knoop is openly calling for community investigation into why this jagged capability profile persists.

Benchmark Integrity

The possibility of selective benchmark optimisation — performing well on V2 while ignoring V1 — is dismissed on logical grounds. Any team motivated to game V2 would also target V1, and Google's conduct in the broader research community gives no indication of benchmark hacking.

ARC-AGI V3 and the Prize Timeline

  • V3 is approximately two-thirds through development as of mid-November 2025.
  • ARC Prize 2025 official results are scheduled for December 5, 2025.
  • The full V3 dataset is targeted for public release in Q1 2026, alongside the ARC Prize 2026 announcement.
  • V3 introduces interactivity — goal acquisition, exploration, and action efficiency comparisons between humans and AI — dimensions absent from V1 and V2.
  • Early V3 game design is showing a widening gap between human ease and AI difficulty, reversing a trend seen during V2 development where some tasks were pruned for being too hard for humans or too easy for AI.

Knoop notes that ARC Prize's design objective is usefulness and longevity rather than longevity for its own sake. V1 has proven useful for roughly five years; V2's median estimated lifespan at launch was 24 months.

What Is Actually Working

The highest-performing approaches at ARC Prize 2025 centre on refinement loops — language models placed in outer loops that iteratively update a program or natural language task description toward a solution. This is meaningfully outperforming the test-time fine-tuning approach that dominated ARC Prize 2024, where competitors took the private puzzle, generated augmented permutations, and applied a LoRA fine-tune on a pre-trained model at inference time.

Knoop frames the last 12 months of headline AI results — IMO gold, ICPC performance, Gemini 3 — as products of the same underlying architecture: a language model base augmented with symbolic program synthesis methods. He argues this hybrid search space remains significantly underexplored and represents the most productive direction for new teams with limited compute budgets.

Macro View on Automation and Adoption

Knoop draws a clear line between what current reasoning systems can already automate and what remains out of reach. Any problem that can be characterised by generating synthetic training examples and extracting a verifiable feedback signal is, in his view, automatable today without further breakthroughs. Problems requiring novel idea generation — what he describes as closer to an AI-complete problem — are not.

He pushes back on narratives that frame the last 12 months as incremental. The AI reasoning paradigm is only 12 months old, and 2025 has largely been about figuring out how to bring these systems into production workflows. Citing data referenced when OpenAI launched GPT-4o with its model router, only roughly one in five users has ever interacted with a reasoning model — diffusion of the technology into actual workflows is still in early innings, a pattern Knoop says he sees directly in enterprise sales conversations at Zapier.

Open Research Question

A 100x to 300x efficiency improvement in AI reasoning systems over the past 12 months should, in theory, free up inference compute for broader search coverage over problem spaces. Knoop's open question is why these systems still appear to under-explore available solution search space — and what changes to training methodology would be required to guarantee fuller coverage. He considers this a critical prerequisite to fully solving V2.