Commentary

Grok 3 launches: state-of-the-art model benchmarks but lags OpenAI on product

Feb 18, 2025

Key Points

  • xAI's Grok 3 tops standard AI benchmarks across math, coding, and reasoning tasks but underperforms OpenAI's deep research product, outputting roughly 1,000 words versus ChatGPT's 5,000-word analyses.
  • Grok 3's personalization feature pulls from user Twitter history without coherent integration, creating surveillance-like behavior that lacks clear purpose or refinement.
  • xAI proved infrastructure speed and scaling — building a 200,000-GPU cluster in 214 days and reaching frontier-model performance — but remains unproven on whether benchmark gains translate to products people actually use.

Summary

Grok 3 Launches: State-of-the-Art Model Benchmarks but Lags OpenAI on Product

Grok 3 tops standard ML benchmarks but falls short of OpenAI's product execution, even as xAI demonstrates unprecedented infrastructure speed.

xAI's Grok 3 hit top-tier performance on standard reasoning and coding benchmarks Monday, beating Google Gemini, DeepSeek v3, Anthropic Claude, and OpenAI's GPT-4o across math, science, and coding tasks. The model completed pretraining in early January using 20,000 H100 GPUs — a subset of xAI's 200,000-GPU Memphis cluster built in 214 days. Musk announced during a livestream that xAI is building a 1 million GPU cluster. The company trained Grok 3 on 10 times the compute power of its predecessor.

Benchmarks matter less than product here. Andrej Karpathy, who worked at OpenAI and has credibility across camps, rates Grok 3 around "state of the art territory" — roughly O1 Pro capability — "ahead of DeepSeek r1" but found its deep search offering "approximately around Perplexity's deep research offering, which is great, but not at the level of OpenAI's recently released deep research."

The gap is real. In direct comparison, OpenAI's deep research outputs run 5,000 words; Grok's deep search runs roughly 1,000. When tested with identical prompts about LLM evolution and scaling laws, both models hallucinated elements of their responses, though ChatGPT's analysis felt more structured and reliable. Grok's reasoning model performed well on specific tasks — building a Settlers of Catan hex grid in HTML, solving tic-tac-toe boards — but failed Karpathy's emoji cipher test that required decoding hidden Unicode. The most telling observation: Grok 3's humor remained poor, a common LLM failure mode. When asked for a structured stand-up joke, it generated: "Why did the chicken join a band? Because it had the drumsticks and wanted to be a Kluck star." It arrives pre-trained on user timelines, fine-tuning itself to individual post history. This produces echo chambers by design. Grok 3 told Elon Musk that The Information is "garbage" and another user the same outlet is "solid," pulling context from their Twitter feeds. OpenAI's product advantage comes not from raw model quality but from time spent on product refinement — a vector xAI has not yet pursued.

Why the timeline fine-tuning matters, and why it matters less than you'd think. Grok 3's personalization based on user posts could theoretically help niche audiences make rapid, in-group requests without verbose prompts. A venture capitalist drowning in fund dynamics jargon could query it densely. But testing showed the model botched the integration. It embedded personal references (Patrick Mahomes, Dakar rally, Apple watches, product managers) that appeared in the user's recent tweets without threading them into coherent comedic logic. One speaker texted xAI engineers and was told the feature could be "really cool for certain prompts." The same feature, left unrefined, reads as creepy surveillance without purpose.

Diminishing returns are already visible. Scaling laws show performance gains flatten. GPT-3 to GPT-4 required a 60x compute jump for roughly a 50% MMLU score gain. A 100x compute increase yields 25 to 60% loss reduction, not proportional gains. Grok 3's 10x predecessor improvement, mapped across likely e26 flops (though both Grok and ChatGPT disagreed on this number when asked), suggests compute is still driving progress. Whether a 5x or 10x increase to future training runs yields 1%, 10%, or 50% capability gains remains unknown and matters enormously for venture returns. Karpathy noted the scaling timeline is "unprecedented" — from incorporation to competitive SOTA in under 19 months — but also cautioned that speed in benchmarks does not guarantee product-market fit.

OpenAI's distribution moat is real, xAI's is constrained. Karpathy uses Grok 2 "quite a bit almost entirely via its integration into the x app" but reaches for ChatGPT elsewhere. "Product matters, and ChatGPT remains my product of choice," he wrote. OpenAI holds the only organic distribution channel built by an AI lab — consumer adoption before enterprise lock-in. xAI's model plugs into X, a specific audience reflective of tech founders and finance bros, not mass adoption. For OpenAI to maintain edge, they need advertising products to monetize free-tier users and fund better tooling.

Musk's $97.4 billion hostile bid for OpenAI's nonprofit makes more sense now. The bid arrived hours after OpenAI announced Stargate, a $500 billion infrastructure play. Musk's lawyers called the bid an attempt to "save the company from the dangerous direction" Altman took it. More directly, Musk told allies and investors via his lawyers that the move will ensure OpenAI "returns to the open source safety focused force for good it once was." This is likely cover. Musk spent six months cementing ties to Trump, raised $6 billion for xAI last November, and controls his compute infrastructure outright. OpenAI is for-profit-converting and depends on Microsoft, Oracle, and SoftBank capital. Buying the nonprofit — which holds the controlling stake in the for-profit — would let Musk redirect governance and slow capital deployment. Altman's response: "Probably his whole position by saying, I don't think he's a happy person."

What xAI proved and what it didn't. Infrastructure speed and engineering discipline at scale — building a 200,000 GPU cluster in 214 days using repurposed factories, Tesla mega-packs, and improvised cooling — is a genuine edge. Scaling laws held; more compute yielded better models. The company hired top talent, moved fast, and landed on benchmarks within striking distance of frontier labs. What remains unproven is whether benchmarks translate to product people use, whether personalization without purpose becomes friction, and whether the gap between "smartest AI on earth" at math and "actually useful AI for knowledge work" requires entirely different optimization surfaces. OpenAI took 4+ years to move from GPT-4 to a consumer-ready reasoning model. xAI has done it faster. Whether that matters depends on whether the market pays for intelligence or usefulness.