Grok 3 vs. OpenAI deep research: a head-to-head prompt breakdown
Feb 18, 2025
Key Points
- XAI's Grok 3 claims state-of-the-art performance on math, coding, and science benchmarks after consuming 20,000 GPUs, but both Grok 3 and OpenAI's Deep Research hallucinate on basic facts about each other's compute intensity.
- Grok 3 personalizes responses based on user posting history, creating algorithmic echo chambers where users unknowingly receive answers confirming existing beliefs rather than challenging them.
- Despite matching or exceeding OpenAI's reasoning benchmarks, Grok 3 trails in practical usefulness: OpenAI's Deep Research produces longer, more methodical analysis, and users still default to ChatGPT because product distribution matters more than raw model strength.
Summary
Grok 3 vs. OpenAI Deep Research: A Head-to-Head Breakdown
XAI's Grok 3 arrived this week with benchmark claims that put it at state-of-the-art territory across math, science, and coding—topping Google Gemini, DeepSeek v3, Anthropic Claude, and GPT-4o. The model completed pretraining in early January after consuming 20,000 GPUs and 10 times the compute power of Grok 2. Elon Musk announced the release during a live stream with three XAI engineers, emphasizing continuous daily improvements to the model.
The narrative around Grok 3 carries real nuance. It is simultaneously surprising and unsurprising. The unsurprising part: XAI has built a 200,000 GPU cluster in Memphis in just 214 days, converting an old Electrolux factory into a data center with Tesla Megapacks for power smoothing. Scaling laws predicted this trajectory. The surprising part: XAI reached this state in 19 months from incorporation—faster than OpenAI's journey from GPT-4 to anything approaching GPT-5, even accounting for substantial updates along the way.
Where Grok 3 stumbles
When tested against the same deep research prompt about LLM evolution, benchmarking, and the interaction of Moore's law and the bitter lesson, both Grok 3 and OpenAI's Deep Research produced tables and analysis. But the character of their outputs diverged. ChatGPT claims Grok 3 is an e26 model (more compute intensive than GPT-4), while Grok 3 itself claims to be e25 (less intensive than GPT-4). The discrepancy reveals both models hallucinating on basic facts about each other—neither providing definitive compute numbers.
OpenAI's Deep Research proved more readable. Its analysis is longer (typically 5,000 words versus Grok's thousand), more methodical, and better structured as an actual research paper. On specific evals, Andrei Karpathy reports that Grok 3 with thinking performs at "around state of the art territory" for OpenAI's o1 Pro, slightly ahead of DeepSeek r1 and Gemini 2.0 Flash Thinking. But on reasoning tasks—decoding hidden Unicode messages in emoji, solving certain logic puzzles—OpenAI's models still edge ahead. Humor remains a notable gap across the industry. When asked for a stand-up joke, Grok 3 delivered: "Why did the chicken join a band? Because it had the drumsticks and wanted to be a Kluck star." The output has form but no punchline.
The personalization problem
Grok 3 appears trained on individual user timelines. When Elon Musk asked the model what it thought of The Information, it said the outlet was garbage. When Paul Callcraft asked the same question, Grok replied that The Information is a solid outlet. The model fine-tunes responses based on what it infers about the user's beliefs and interests from their posting history. This creates a different kind of risk: not censorship, but algorithmic echo chambers baked into the model itself. Users will unknowingly receive answers that confirm their existing views rather than challenge them.
Benchmarking and the moat question
The comparison also surfaces a real tension in how foundation models are evaluated. Grok 3 omitted o3 from its benchmark charts during the livestream, citing that o3 remains private and unverifiable. But Grok 3 with reasoning is also not yet publicly available, so users cannot independently validate those results either. The selectivity cuts both ways, but the asymmetry matters for credibility.
Beyond raw capability, the strategic story is product-level distribution. Karpathy uses Grok 2 primarily through X integration. He continues to prefer ChatGPT as his default tool, noting that "product matters" more than underlying model strength. OpenAI built organic distribution through a consumer app; XAI bolted its model onto an existing platform. Whether X's audience—tech-forward, finance-engaged, often contrarian—maps to broader consumer demand remains an open question.
The broader compute picture
Both transcripts reveal a subtle point: scaling laws show diminishing returns. A 60x jump in compute from GPT-3 to GPT-4 yielded roughly 50% improvement on MMLU. Going from GPT-4 to Grok 3 likely follows the same curve. The question is whether incremental gains—say, 5 to 10 percentage points on benchmarks—translate to qualitative leaps in usefulness. The transcripts suggest skepticism. Grok 3 beat o3 mini high on reasoning benchmarks, but the practical gap feels narrow. Both models still hallucinate. Both struggle with certain logic tasks. Neither is yet a reliable substitute for human judgment on real work.
XAI is catching up on raw model capabilities in an unprecedented timeframe. The product, distribution, and consumer experience tell a different story. Grok 3 is strong. It is not yet the clear winner in how people actually use AI.