Interview

Inception Labs' diffusion LLM hits 1,000+ tokens/sec on standard Nvidia GPUs, targeting latency-sensitive coding and voice agent use cases

Feb 24, 2026 with Stefano Ermon

Key Points

Inception Labs' Mercury 2 achieves 1,000+ tokens per second on standard Nvidia GPUs by using diffusion-based parallel text generation instead of traditional left-to-right token prediction.
The speed advantage is algorithmic and hardware-agnostic, scaling linearly across GPUs via AWS Bedrock and other cloud providers rather than requiring custom silicon.
Inception targets latency-sensitive applications like coding IDEs and voice agents where reasoning quality meets real-time interaction constraints, with parallel inference as its structural moat against replication.

Inception Labs' diffusion LLM hits 1,000+ tokens/sec on standard Nvidia GPUs, targeting latency-sensitive coding and voice agent use cases

Summary

Inception Labs announced Mercury 2, a diffusion-based reasoning model that achieves over 1,000 tokens per second on standard Nvidia GPUs including Hopper and Blackwell hardware. The model applies diffusion's parallel-generation approach to text and code instead of the left-to-right token prediction used by traditional autoregressive LLMs.

Diffusion models start with a rough output and iteratively refine it. Inception adapted this to text by allowing the neural network to modify many tokens simultaneously rather than one at a time. Co-founder Stefano Ermon describes the process as "coarse to fine generation"—beginning with a blurry guess at what the answer should be and progressively sharpening it, with the neural network learning this refinement pattern rather than following an interpretable sequence of steps.

“We've taken diffusion models — the thing that works best for image and video generation — and figured out a way to apply it to text and code generation. The neural network is able to modify many tokens at the same time, so it's much more efficient than the typical autoregressive model. In practice we can get to over 1,000 tokens per second on traditional Nvidia GPUs, Hopper, Blackwell.”
— Stefano Ermon

The speed gain is algorithmic rather than hardware-dependent. Inception's diffusion approach is compute-bound rather than memory-bound, allowing it to hit the theoretical ceiling of the GPU's floating-point capacity. This differs from custom-silicon approaches like Chatjimmy's 16,000 tokens per second on proprietary hardware. Inception can scale linearly by adding more GPUs and runs on general-purpose infrastructure available through AWS Bedrock and other cloud providers.

Target use cases are latency-sensitive applications including coding IDEs for autocomplete, edit suggestions, and refactoring; voice agents that need fast response times; and retrieval applications like query rewriting, reranking, and summarization. Mercury 2's reasoning capability addresses a specific tension in voice agent development: customers need the quality of a reasoning model but cannot tolerate slow inference. Inception's approach trades sequential token prediction for parallel refinement, delivering reasoning-level quality with latency budgets suitable for real-time interaction.

Ermon acknowledges distillation risk. Recent work by Chinese teams shows models can be effectively distilled from API access using surprisingly few data points. He frames this as a scientific fact of the field rather than a defensive problem for Inception, noting that once API access exists, replication becomes inevitable.

On benchmark performance between diffusion and transformer-based LLMs, Inception's models perform well on coding and editing tasks, partly because diffusion can use context from all directions rather than just left-to-right context. Ermon attributes some of this advantage to training data choices, as the team prioritized coding workloads. Speed itself is the defensible moat. The parallel inference pattern is structurally harder to replicate than benchmark performance alone.

Read full transcript →

Every deal, every interview. 5 minutes.

TBPN Digest delivers summaries of the latest fundraises, interviews and tech news from TBPN, every weekday.

You might also like...

Sam Altman says OpenAI's $6.5B IO acquisition could add $1 trillion in value

May 23, 2025

Dylan Patel on Google's TPU going external: how it's already saving OpenAI 30% on Nvidia spend before deploying a single chip

Dec 1, 2025

Pippa Lamb on the UK-US state visit: $30B Microsoft and Nvidia commitments, Revolut IPO watch, and UK as AI superpower

Sep 19, 2025