Inception Labs' diffusion LLM hits 1,000+ tokens/sec on standard Nvidia GPUs, targeting latency-sensitive coding and voice agent use cases
Feb 24, 2026 with Stefano Ermon
Key Points
- Inception Labs' Mercury 2 achieves 1,000+ tokens per second on standard Nvidia GPUs by using diffusion-based parallel text generation instead of traditional left-to-right token prediction.
- The speed advantage is algorithmic and hardware-agnostic, scaling linearly across GPUs via AWS Bedrock and other cloud providers rather than requiring custom silicon.
- Inception targets latency-sensitive applications like coding IDEs and voice agents where reasoning quality meets real-time interaction constraints, with parallel inference as its structural moat against replication.
Summary
Inception Labs announced Mercury 2, a diffusion-based reasoning model that achieves over 1,000 tokens per second on standard Nvidia GPUs including Hopper and Blackwell hardware. The model applies diffusion's parallel-generation approach to text and code instead of the left-to-right token prediction used by traditional autoregressive LLMs.
Diffusion models start with a rough output and iteratively refine it. Inception adapted this to text by allowing the neural network to modify many tokens simultaneously rather than one at a time. Co-founder Stefano Ermon describes the process as "coarse to fine generation"—beginning with a blurry guess at what the answer should be and progressively sharpening it, with the neural network learning this refinement pattern rather than following an interpretable sequence of steps.
The speed gain is algorithmic rather than hardware-dependent. Inception's diffusion approach is compute-bound rather than memory-bound, allowing it to hit the theoretical ceiling of the GPU's floating-point capacity. This differs from custom-silicon approaches like Chatjimmy's 16,000 tokens per second on proprietary hardware. Inception can scale linearly by adding more GPUs and runs on general-purpose infrastructure available through AWS Bedrock and other cloud providers.
Target use cases are latency-sensitive applications including coding IDEs for autocomplete, edit suggestions, and refactoring; voice agents that need fast response times; and retrieval applications like query rewriting, reranking, and summarization. Mercury 2's reasoning capability addresses a specific tension in voice agent development: customers need the quality of a reasoning model but cannot tolerate slow inference. Inception's approach trades sequential token prediction for parallel refinement, delivering reasoning-level quality with latency budgets suitable for real-time interaction.
Ermon acknowledges distillation risk. Recent work by Chinese teams shows models can be effectively distilled from API access using surprisingly few data points. He frames this as a scientific fact of the field rather than a defensive problem for Inception, noting that once API access exists, replication becomes inevitable.
On benchmark performance between diffusion and transformer-based LLMs, Inception's models perform well on coding and editing tasks, partly because diffusion can use context from all directions rather than just left-to-right context. Ermon attributes some of this advantage to training data choices, as the team prioritized coding workloads. Speed itself is the defensible moat. The parallel inference pattern is structurally harder to replicate than benchmark performance alone.