News

Google launches Gemini 3, doubling state-of-the-art on ARC-AGI v2 benchmark

Nov 18, 2025

Key Points

  • Google's Gemini 3 doubles prior state-of-the-art on ARC-AGI v2, with the Deepthink variant reaching 45% versus GPT-4 Thinking's previous lead, closing perceived gaps with OpenAI and Anthropic on frontier models.
  • Gemini 3's visual understanding and computer-use capabilities shift from weak to functional, enabling practical agentic AI features like interactive web page generation that could unlock reliable autonomous agent behavior.
  • Google launches Anti-Gravity, an agent-first IDE treating code generation as 90% completion requiring human refinement, positioning the company to capture value from model improvements across downstream products.

Summary

Google launched Gemini 3, a foundational model that doubles state-of-the-art performance on the ARC-AGI v2 benchmark, a puzzle-solving test historically difficult for AI but easy for humans. Gemini 3 Pro achieves 31% on ARC-AGI v2, while Gemini 3 Deepthink preview reaches 45% at $77 per task. Both significantly outpace GPT-4 Thinking, which held the previous benchmark lead.

Matt Shumer claims the performance jump rivals the magnitude of GPT-4's March 2023 release. On ARC-AGI v2's fastest tasks, Gemini 3 Pro solves problems in 188 seconds, approaching human speed of 147 seconds average while maintaining competitive accuracy. The model also gains on reasoning benchmarks like Humanity's Last Exam (37.5%) and beats GPT-5.1 and Claude Sonnet 4.5 on numerous benchmarks.

Beyond raw scores, practical capability improvements emerge. Visual understanding and computer use—navigating websites and interfaces—have shifted from "really really bad" to "reasonably solid," suggesting genuine progress toward agentic AI rather than marginal gains. The model generates interactive web pages as shareable URLs, a feature that could function as a growth loop. In simulated vending-machine business scenarios, Gemini 3 Pro outearned GPT-5.1, Claude Sonnet 4.5, and Gemini 2.5 Pro combined.

Andrei Karpathy, who tested early access, endorsed Gemini 3 as "clearly a tier one LLM" with "very solid daily driver potential" across personality, writing, vibe coding, and humor. He cautioned that public benchmarks risk overfitting through elaborate optimization of test-adjacent data. Usable quality matters more than benchmark scores alone.

Observers remain split on significance. One characterization: excellent but incremental, "newer, better, smarter, faster, stronger" without binary capability leaps like ChatGPT's natural conversation or Cursor and Devon's code-writing. Another view argues that improved computer-use and visual reasoning could finally unlock reliable agentic AI, which would be transformative.

Google is also entering the IDE market with Anti-Gravity, an agent-first development environment that treats code generation as 90% completion needing human refinement to 100%. Developers can annotate and comment directly on generated code and UI mockups, creating a collaborative loop between agent and human. This design approach differs from traditional sidebar AI helpers.

Google faced perception of lagging OpenAI and Anthropic on frontier models despite significant internal AI and infrastructure assets. Gemini 3's benchmarks suggest the gap has closed on the foundation model side. Whether this lead persists depends on OpenAI's next release—GPT-4.1 benchmarks remain unconfirmed—and how quickly downstream products capture value from model improvements.

The industry cycles through releases where each lab's newest model becomes the temporary benchmark winner until the next launch arrives. This pattern complicates strategic assessment. Sundar Pichai's shift from "AI AI AI" messaging to letting the model demonstrate itself marks a tactical change, though whether Gemini 3's performance translates to consumer or enterprise adoption remains an open question.