News

OpenAI and Google DeepMind both hit IMO gold-medal math performance — sparking AGI debate

Jul 21, 2025

Key Points

  • OpenAI and Google DeepMind both achieved gold-medal-level IMO performance, but neither meets the official criteria for an AI gold medal, which requires open-source release and independent verification.
  • OpenAI's model reasons for hours using only natural language and learned compressed notation, while DeepMind relied more heavily on formal proof verification—both approaches work but neither is proven to generalize beyond math.
  • The systems are not yet productized or economical at scale; frontier labs are optimizing for capability over cost, with reasoning performance costing thousands per problem and requiring infrastructure that doesn't yet exist in practice.

Summary

OpenAI and Google DeepMind both achieved International Math Olympiad gold-medal-level performance on the same benchmark, sparking immediate debate about what the milestone actually means and whether either lab can claim a genuine win.

OpenAI announced first, posting results at 12:50 a.m. on July 19th. DeepMind had achieved gold-medal performance on Friday afternoon but waited until Monday to announce, citing respect for the IMO board's request that labs hold results until independent verification was complete and human students received proper credit. The timing split creates two separate stories: one about capability, one about communications.

Technical paths

OpenAI and DeepMind took different architectural approaches to similar results. OpenAI relied on text-based LLM reasoning with reinforcement learning on non-verifiable rewards, letting the model reason through problems in natural language for hours at test time. DeepMind leaned harder on Lean, a formal math proof verifier, to scaffold the reasoning process. Both approaches worked. Neither is obviously dominant yet, and the question of which scales better or generalizes further remains open.

Gold medal level performance

DeepMind's Demis Hassabis framed it as achieving "gold medal level performance" with 35 out of 42 points, solving five of six problems. OpenAI's framing was similar. But the IMO organization has stricter criteria for officially awarding an AI a gold medal: the model must be open-source and released in a particular way. Neither lab has met those bars yet. Both cleared the performance threshold but neither has technically won the medal.

Terrence Tao's lengthy analysis made the nuance concrete. He walked through the standard IMO format—six high school students, pen and paper only, four and a half hours per day, no external tools—and then listed every way that format could be altered to make the task easier: longer time limits, rewritten problem statements, access to calculators or proof assistants, team collaboration, selective submission, and the ability to withdraw without reporting participation. Some of those advantages don't appear to have been taken here. Tao's broader point landed hard: without controlled methodology and independent verification, apples-to-apples comparison between AI systems and human performance is inherently fraught.

Reasoning at scale

OpenAI's model thinks for hours to solve a single problem, compared to seconds or minutes for prior systems. The company framed this as developing "a new technique that makes LLMs a lot better at hard to verify tasks." The model uses only natural language—no Python, no Lean, no web search. Daniel Lit, an AI researcher, called this surprising since most expected systems would need formal verification tools or code execution to reach this level.

The model's solutions are written in compressed, non-standard notation with abbreviations, sentence fragments, and invented shorthand, apparently to reduce token count and improve efficiency. It is effectively learning its own language to optimize inference.

Cost and optimization

Gary Marcus opened with skepticism, calling it "almost cute" that "insanely expensive new form of LLMs can now match top high school students on one specific task." Will Depw pushed back directly: "Stop using expensive as a disqualifier. Capability per dollar will drop 100x a year." He noted that the $3k-per-task performance in early test runs could probably cost $30 if optimization were the priority. Frontier labs are not optimizing for cost right now. They are optimizing for capability.

Generalization

The most important technical claim is that this is not an IMO-specific model. It is a general-purpose reasoning LLM using new experimental techniques. That distinction matters because prior AI breakthroughs in narrow domains like Go, poker, and Diplomacy often failed to transfer. If this model can actually reason at IMO level and then do something else useful—code, science, whatever—the capability profile changes entirely. Neither lab has published evidence on that yet.

Moving the goalposts

A familiar pattern emerges: achieve a benchmark, declare progress or AGI-adjacency, watch observers move the bar. François Chollet noted that intelligence is not a collection of skills but the efficiency with which you acquire and deploy new ones. Sebastian Raschka called it a potential moonlanding moment. Run suggested his bar for AGI is an AI that can learn to run a gas station for a year without a curated dataset. Thomas Wolf wants a Nobel Prize for a novel theory. The posts got progressively farther away from IMO gold.

Tyler Cowen stood out for consistency: he said the Turing Test was his bar for AGI, we've passed it, so technically AGI arrived. Most people aren't moving the goalpost but redefining what matters.

Productization timeline

None of this is immediately productized or economical at scale. The infrastructure to turn frontier reasoning into everyday tools doesn't exist yet. The UI, the cost economics, the integration into existing workflows all require time. This is a slow takeoff scenario, not a sudden inflection. The IMO gold is a milestone, not a shipment.