Inception Labs' diffusion LLM hits 1,000+ tokens/sec on standard Nvidia GPUs, targeting latency-sensitive coding and voice agent use cases
Feb 24, 2026 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Stefano Ermon
Phantom Cash. fund your wallet without exchanges or middleman and spend with the Phantom Card. And without further ado, we have Stephano from Inception Labs. He's the founder and CEO. Welcome to the show. How are you doing?
Very good. Thanks for having me.
Thanks for hopping on. First time on the show, so I'd love to have you kick it off with an introduction on yourself and the company.
Of course. Yes, I'm Stephano. I'm one of the founders and the CEO of Inception. Uh before this, I was at Stanford in the CS department. been doing research in generative AI for a long time.
I think my lab is mostly famous for having co-invented diffusion models back in 2019. I was on the flash attention paper, DPO. So, a bunch of things that are now widely used in production and
these days I'm most excited about the diffusion language models. That's what we're doing at Inception.
Yes. So, uh I first saw a diffusion language model demoed at Google IO, I believe. Uh but uh tell us like when explain it like I'm five because when I think diffusion I think a bunch of fuzzy noise and then the and then the midjourney image gets higher and higher resolution. Everyone's familiar with that and then they're familiar with like the token streaming next token prediction. Is it different? Break it down at a very low level or high level
that that's right. Basically we've taken diffusion models which is the thing that works best for image and and video generation. kind of course defined process where you iteratively refine your output until it looks good
and we figure out a way to apply it to text and code generation.
Okay.
And it kind of like works the same way. You start with a rough guess of what the answer should be and then you refine it.
Okay.
And crucially the difference is that the neural network is able to modify many tokens at the same time. Yeah. Yeah. And so it's much much more efficient than the typical auto reggressive model where you generate left to right one token at a time and you're able to modify many tokens in parallel.
So if I'm if I'm thinking of like uh you know not maybe like a like a deep research report type response. Uh I in my mind I can imagine a report you know saying like explain the history of the Roman Empire. That's the that's the example I always use. It's like it's going to have some structure to it and I'm going to imagine a blurry image with like a couple large headers and then the headers are going to get filled in. Then the text is going to filled in. Maybe there's some bullet points. Maybe there's some some dates. Maybe there's some charts and like all of this is going to come together. But I'm thinking about it not sequentially but as a whole and then refining iteratively until I'm getting to instead of pixels I'm thinking of individual characters or are there tokens in the same way that might exist in an LLM? What does that look like? Yeah, that's the right intuition. Uh so it's kind of like yeah course to find generation and in practice u you know it's learned by a neural network. So it's not necessarily interpretable like it's not the kind of
process I would go through where maybe I start with head with you know section headings and then I fill in the details. It's all learned by a neural network and so it's not really interpretable. Uh but it's fast. That's really the
So is speed the main thing? I mean, we we we we had uh the founder of chatjimmy.ai on the show, Talis, and it seemed like he was able to bake down a traditional LLM uh Llama 38B onto silicon and it was spitting out 16,000 tokens per second. Do you have a comp on speed or cost that you're targeting or do you see like a through line to like okay maybe if we're running on Nvidia chips and he's running on custom silicon he's going to be faster but then once we get to custom silicon we're going to be 10 times faster than that how how should I be thinking about the trade-offs here
yeah so the our benefit is purely at the algorithmic level like it's just a more parallel
approach that is not memory bound it's it's flops bound right it's comput bound
uh so you're able to hit the ceiling of the roof line and we are taking you know advantage of all the resources we can get access to on the GPU.
Uh in practice what this means is that we can get to over a,000 tokens per second.
Wow.
Uh on traditional Nvidia GPUs, Hopper, Blackwell.
Yep.
Uh so we are not yet at the level you know of the 16,000 tokens that you can get if you were to actually you know implement the model on hardware but we're running on you know general purpose GPUs. So we can scale up as much as we want. It's just a matter of getting more GPUs and you know you can just run these models anywhere. We are on bedrock. We are on foundry. So if you have your own GPUs you can provision your own capacity and you can run your model our models there.
So it's it's very very scalable. It's fast and scalable and in principle yeah it can be compounded. You know you have a 10x benefit from the software you have a 10x benefit from the hardware. Those two things could be combined. That's what uh what usea you know there's a lot of people out there using uh traditional language models today. What are the kinds of use cases where you would tell somebody you should be switching over today or at least trying to uh start experimenting?
Yeah, we're seeing a lot of traction in latency sensitive applications of LLM like whenever there is like a tight loop where you need to interact with a developer or a customer. So our models are being deployed in a bunch of IDEs. So if you think about coding uh coding autocomplete, next edit, suggestions, refactoring, quick agentic loops, that's a very natural kind of like application where diffusion are already really really good voice agents. Uh we have a number of partners and customers that are building really really good voice agents. The latest models we announced today, Mercury 2 is a reasoning model. So, but it's really really fast and so you can get the quality of a reasoning model with the latency budget that you need uh whenever you want to build a voice agent which is resonating really well uh with a bunch of early customers. Um retrieval and search that's another space where we're seeing a bunch of applications being built on diffusion. So if you think about uh query rewriting, reranking, summarization, that's another really really good use case for diffusional lumps.
Uh talk to us about uh distillgate, how you've been processing it. Did have you worked on that? Have have you
It's a sign of success.
Uh had did you any any points in your career were you were you experimenting with this stuff? Is this something that we kind of forced the Chinese market into spending a lot of resources on? I mean it makes sense, right? That that that's what's always going to happen. I think uh the moment you put it out there, you know, you give API access to the world that that's going to happen and people are going to copy you.
I mean, we've been doing distillation uh in in the research community for a long time and so people have been experimenting and figuring out ways to do it in a in a sample efficient way. So I'm not surprised that it that it's happening. I think uh it's hard to know at what scale and honestly it from the numbers that they were circulating it seems like they are able to do it with very very few data points. That was the most surprising thing to me. Uh so you know it's it's very interesting scientifically that you can actually distill with with so few data points because it means that it's going to be very very hard to to protect any IP.
Yeah.
You are opening the model up from a from an API point of view. So the last question somewhat related to that uh I feel like when these models get distilled uh we see very strong benchmark performance and then some yet to be quantified and benchmarked quality sort of degrades and you hear people that actually try and put them into production saying like ah it just doesn't have the same like big model flavor that I'm getting from the big labs. I don't know how real that is, but I'm wondering if you zoom out and you look at uh and you look at diffusion versus uh transformer-based LLMs, are you noticing any diff divergence in the benchmarks where you're maybe better at coding or less good at coding where the mental model that we're giving the computer is leading to surprising results?
Yeah. So, what we're seeing is that it's it's it's good at coding. It's good at editing. One nice thing about not necessarily being left to right is that you can use context all around you. So those use cases have emerged as being really really good for diffusion labs. I think it's also a function of the training data that we use. U you know we always liked coding. We're all computer scientists and so that was like a a very natural kind of application area for us. And so um I don't know how much of that depends on the training data that we used versus the model. Um, but what's exciting is really just like the speed. That that's the thing that that
yeah,
it's going to be hard to replicate.
I got I got a need for speed. I got a need for speed. I'm super bullish on speed. I'm serious. I think it's amazing. I used uh 5.3 Spark on Cerebrus and I was like, this is the future. It's going to come to everything and it's going to be an important moment for people to realize that uh it's just a different product when you're interacting with something fast. Uh, and I think we learned this from Amazon squeezing out milliseconds in uh, in web page loads and we're going to experience it in AI, too. So, thank you for everything that you're doing to speed up AI. Uh, we loved having you on the show. So, have a great rest of your day.
Yeah, great to meet you.
We'll talk to you soon.
Goodbye.
Let me tell you about Console. Console builds AI agents that automates 70% of IT, HR, and finance support, giving employees instant resolution to access for access requests and password resets. And let me also tell you about Railway. Railway is the all-in-one intelligent cloud provider. Use your favorite agent to deploy web apps, servers, databases,