Groq CEO Jonathan Ross: LPUs beat GPUs on inference economics, new European data center live in under a month

Jul 8, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Jonathan Ross

couldn't save me. My But you can get a pod five at eight. comtbn. Uh, and one of the issu we got somebody. Yeah, we have Jonathan here. How's it going? It's good. How are you? Welcome. Thanks so much for stopping by. Sorry about the little mix up with the scheduling. Uh, glad you could make it though. Great to meet you.

Thanks for having me. Um, would you mind uh introducing the company and the and the current state of the products that you sell just for anyone that might not be familiar? Sure. So, the company is Gro JRQ and uh we sell uh tokens as a service.

So very much like you use OpenAI's API or Enthropics API except we serve mostly open source models. Um and we've built our own AI chips that allow us to do this much faster than anyone else. Yeah, I remember the initial demos.

I believe it was on an early version of Llama and the inference rate was just so much so much higher. What were the key decisions that you made to enable faster inference? One of the key ones was we don't use any external memory.

um every time you read from memory, the chip just has to wait for it and it's very slow and it's like drinking a drink through a martini straw. Um and so what we've uniquely been able to do is keep the cost down while also being very fast.

You can actually get speed off of a GPU, but then it just becomes completely uneconomical to use because you have one user using that at a time. So what are the trade-offs of of not having uh offchip memory? I imagine, you know, there's a big discussion over how important are large context windows.

Is that an important factor to uh to kind of how LLMs are scaling and and ultimately you're somewhat beholden to the decisions that are made by AI researchers at big labs, right? Well, interestingly, we actually support um very often the largest context or the largest context that's practically offered by anyone else.

Um and this is a little counterintuitive to most people. The trade-off we have is memory capacity, but GPUs have a memory bandwidth issue. So, what the researchers do to improve the memory bandwidth issues always improve the memory capacity issues for us.

Um, so we can run all the same models with the same context length. The trade-off is actually something else. Um, you could think of us a little more as a nuclear reactor instead of thinking of us as a diesel generator. Whereas GPUs, you can take just eight GPUs and use them for a model.

we need a lot more of our LPUs, but when we're running that model, very much like a nuclear reactor, we get much better cost and it's better for the environment because we in our case, we use less energy. Interesting. Um, what does the shifting landscape around different custom silicon look like right now?

I feel like every hyperscaler is taking it more seriously. We've heard stories of Microsoft wanting to pull back from Nvidia chips. It seems like Amazon's kind of doing the same thing with tranium and inferentia. Um, what what is it?

Are we going to just be in a world where every hyperscaler has their own chip or do you think that the market will fragment in kind of a different way? Well, my background I'm actually the person who came up with Google's TPU chip. So, I started as 20%. Um, and no way. Wait, it actually started as 20% project. It did.

It was unended. That's amazing. Yeah. So, there are 20%s. That's a real thing. But um what most people don't realize is that there were actually three different AI chip efforts at Google and that was the only one that actually survived and was competitive with GPUs. The other two ended up getting cancelled.

And so when you start looking at the broader and broader uh world of chips being built, a very small number of them are actually going to be uh competitive with GPUs. So most of those are going to be canceled. But I think where people get really confused is they keep talking about the chips.

And so, uh, it's not about the chips, it's about the software. Uh, there are all these little features like, um, prefix caching, um, speculative decode, um, uh, and all sorts of things like that.

If you didn't have one of those features, you wouldn't actually be able to be competitive with the the, um, latest GPU, no matter how good your chip was. Sure. So, um, the thing is, you have to first catch up in software, and only then do you get to compete on the chip.

And so one of the things that we did that was very unusual is we spent the first six months working on a compiler which makes our software work very easily. So we can actually compete on the software level um with uh Nvidia. Uh we have the same models running and only then do we get to compete on the chip.

So are you talking about at Google that was like the the push to break out of like the CUDA ecosystem essentially? Well actually this was uh at Grock. So at Google we had these handwritten kernels which are these sort of assembly routines and it's uh the same as with uh GPUs. Okay.

GPU and GPUs are actually more similar than what we've built here at Grock. Yeah.

How how do you think about the the current AI paradigm and how long like the current algorithmic machine regime will last or or where how it will change because I feel like the the narrative around companies that bet on a particular, you know, okay, we're going really long on the transformer or something like that.

Then there's always this question of like, will the transformer stick around? Well, it's been around for a long time. It seems really good. Maybe it'll be around forever. Maybe it won't. Um how do you think about those? Is that is that the right question to be asking? Yeah.

I I remember when we were at Google um we hired this really senior person right before we were done designing the TPU and he came in and he said, "You cannot build a generally programmable chip that will out compete GPUs. It's impossible.

Instead, we need to stop what we're doing and just create a uh chip for a particular model. " Sure. and we had to walk him through the whiteboard and show him, but we actually ended up proving to him we made a faster chip by making it general.

And the the counterintuitive part was if you're actually able to optimize some of the circuitry in the chip to be reused for many different things, but you're only designing a couple of those circuits, you can make them much better than if you have a whole bunch of different stuff that's, you know, not as optimized.

And when you look at one of these models, you can't fit one of these models on a single chip anyway. You have to use multiple chips. So, are you going to design a different chip for each part of the model? And it starts to become ridiculous and doesn't make any sense.

There's actually been people recently who've been trying to uh revive that idea. Um, but in the time when they started uh since they started, became a thing, mixture of experts. So, if they had already started designing their chip, the chip would be obsolete.

And then now we're seeing other sorts of things emerge that are very interesting and if any of those win, those chips will be obsolete. So we always focus on as general of a programmable chip as we can. Switching gears a little bit, uh I believe you're in Europe right now. It looks dark outside.

Uh and it looks like you're you're Thank you for staying up. Thanks for staying up to hang with us. Uh talk about the decision to put the first European data center center in Helsinki and kind of everything that went into that.

So, we were at an event here, an AI event, and about a month ago, we decided uh it would probably be great if we unveiled a European data center while at this event. And literally a month later, we now have it we now have it up and running um actually running models.

So, I think that's a record for time to decide to deploy um and getting it up and running. We actually congratulations. Well, we actually weren't expecting it to have it until the end of this week, which would have been insane in of itself, but it's actually up and running now. So, kudos to the team.

Um, the team is probably sweating because they're like, "Okay, this is the bar now. We actually have to be faster than this to to the previous bar was 51 days. When we deployed about 20,000 chips in Saudi Arabia, it was 51 days from signing before we set it up. " Oh, I remember that deal. Yeah.

Now it's like a month and three days or something. So what what makes that possible? Is it just the en is it the energy efficiency? Is that a huge part of it? Best friends with Morris Chang TSMC to make our chips maybe. Okay. Yeah. Um in our case, we've actually simplified the architecture pretty significantly.

We don't use any um external switches as part of our core interconnect only to move data in and out from the system. Um it's pretty simple cabling. Um air cooled still. We're not liquid cooled. We like to stay a generation behind on all of the technology and just compete on the architecture and the software. Okay.

And so that allows us to use these highly proven technologies where everyone's already debugged them and we're not debugging them as we deploy. But also um because we don't use that external memory. So that external memory tends to lead to a lot of uh chips that fail in field.

Um and so as you're bringing them up, a lot of them fail. You have to fix them. Um, and some there was a recent model that we ran that required 4,000 chips. The first time we ran that model on the 4,000 chips, it worked. And we would have never been able to do that if it was 4,000 GPUs.

Um, the first time you run something on 4,000 GPUs, you have to replace a bunch of them everywhere. Um, there's been there's been some rumors that you've operated the uh like the the token business, the API at a loss. that feels like a completely rational strategy in many ways. Um, nothing wrong with that.

But can you comment on that as a as a strategy? Is that just a misconception? Um, is that deliberate? Um, how do you think about just investing for adoption broadly? Because there's like it's really important as a business to grow and there's a whole bunch of different strategies and tools in the tool chest.

What do you think is valuable? What's what's been the strategy to date and how's it evolving?

I let's let's get even more specific than at a loss because when you're running a service you can claim you're not at a loss because you have an infinite amortization time on your capex and your chips and that's what a lot of the neo clouds are doing.

So we actually agra uh uh aim for a more aggressive payback period than anyone does with GPUs and part of the way we can get away with this is first of all um our electricity cost is about onethird per token versus a GPU. So our opex is already lower. Okay. And that's sort of the flat.

You have to beat your opex otherwise you are losing money period. There's no accounting tricks you can do. And then on top of that um the question is how quickly can you pay back for us. Everything we run is above our opex cost. So we're not losing money. But different models have different payback periods.

Some of them as quick as two years on our V1 silicon. Some take you know four years. And it's a blend, right? Okay. And that overall blend is what we're happy with. Makes sense. Yeah. On on that note, um there's one world where, you know, the models are getting better.

Everyone just wants the best model for every single task and they're just willing to pay for the latest and greatest constantly.

there's this other interesting model where I feel like we get a new capability whether it's just the ability to convert you know tech CSV to JSON or something really or like raw text to JSON or data extraction or you know um censoring you know bad words and a whole huge transcript and like the problem is completely solved by the LLM and you don't need to throw a more advanced model at it in the future even if one comes down the pipe you just need speed and and reliability and and eventually cost.

And so do you see is is there an important part of your business where you can be the the like baseline provider for a current capability that will continue for decades potentially even as a newer model comes online or do you want to try and shift the business towards the being constantly on the most aggressive side of the frontier and is there actually a tradeoff there or can you just do both?

So um a very common usage pattern we see is that people will develop their systems on the latest and greatest model. Yeah. And then um they will optimize it for two things. One is they'll optimize it for speed and the other is they'll optimize it for cost. Yeah.

Very often the most successful AI startups are spending almost as much per month on token as a service as their total revenue, right? and they'll they'll lose money on paying employees and things like that, but they're trying to keep those balanced. Yeah, we can bring the the cost per token down.

Interestingly enough, the uh general inclination is that they want to spend the same amount. They just want more tokens because those tokens can actually turn into more quality if you can iterate, if you can do more in parallel. The speed thing is super important because that's engagement.

So um if you think about it um uh every time you uh increase the performance of a model by about or any service by about 100 milliseconds your um engagement rate or your conversion rate tends to improve about by 8%. That's on desktop. On mobile um it's over 30%. And so people have no patience on mobile whatsoever.

Y and so what we see is people get stuff working.

they prove that it's possible and the most capable model and then they'll um try and find one of the models that we run all the open source models um and then um when they're able to find one that works for them because they have to try a couple then they're very happy and then they don't change things and so there are a couple of models which are actually quite old quite terrible that we would love to deprecate but there's a bunch of users because their system works and they don't want to touch it.

Um the other thing is and this one's interesting.

So we had one of our engineers uh try solving a bunch of math theorems and they were using um this thing called lean which is a formal theorem prover and while and they would use different LLMs in the background to solve and then they would use lean to test um every single time Claude Opus would actually solve it in the fewest iterations.

However, uh Quen 332B running on our chips because they're so fast always solved the theorems faster. It may have more iterations, but it was able to go through them faster and so you actually got the result faster with the less expensive model running on Grock. Interesting. It's been a crazy year so far. Yeah.

Uh we're about halfway through. What are you excited about in the next six months on the Grock side and then what are you excited about in the broader open source ecosystem? Well, on the Gro side, uh we're continuing to scale up. We're going to add a bunch new data centers and new continents.

Um uh after this, I'm going somewhere else in the world. I'm not going to tell you where where we're working on getting another data center um uh deal done. Um amazing. But congrats. Also, there are some rumors that we might have some new improved systems or chips or something coming later this year.

And there's allegations allegations of massive advancements. Rumor rumors of breakthrough. A major breakthrough. Yeah, I love that. Um, can you give me uh your uh retrospective but also your current take on just Moore's law broadly?

Yeah, I I think Moore's law is true, but I don't think it's the most important part, and people miss this. So, Moore's law, for those who don't know, every 18 to 24 months, the number of transistors on a chip doubles. That translates to the economics get better.

That translates to um uh the chips get faster and you can trade that off.

what's really been happening in the last 5 to 10 years and I don't know what to call this law um but like when I left to start Grock one of the observations was every 18 to 24 months not only did the transistors double but the number of chips doubled the transistors doubled and the number of chips doubled and so when we started we actually uh asked ourselves the question what would happen if we assumed there were an infinite number of chips how would you design the system differently and that's one of the reasons we decided to rid of the memory altogether because the memory just holds bits that are doing nothing waiting for an active chip to do something with them.

We're like, let's just make everything active and do everything all at once. And so that was a a very important um uh observation for us. And I think people are going to focus more on the number of chips doubling over the next couple of years than um how much each chip improved. Can you I know uh we we'll let you go.

This is my last question.

Um unless Jordy has something but uh can you tell us a little bit about uh broadly what uh international leaders in other countries generally are are thinking or what's driving their motivation to spin up token factories these AI factories new data centers uh the the the sovereign AI efforts um and because I'm thinking back to you know the development of the original cloud computing revolution and it felt like there wasn't as much of an incentive or or narrative around well we can't possibly just let AWS come over here and set up shop this feels different what are the motivations is it is it economic is it freedom of speech like what what what are the what what are the shape of the conversations that you're having that get an international leader excited about setting up an AI factory a token factory a data center so It's a little bit less about um fear and control and it's actually more about they're not getting enough compute.

So, think of it this way. Um GPUs are expensive and there's only a finite number of them being built not because they wouldn't build more GPUs, but because they rely on that external memory, which is a real bottleneck in the supply chain.

Um and so Nvidia is going to build every single GPU that they can physically build this year. AMD is going to build every single GPU that they can physically build this year and then they're going to sell every single one of them. If they could build more, they would sell more.

That means that there's an allocation problem. Countries aren't getting all of the chips that they need for what they're doing internally.

So, for example, when we built that data center in Saudi Arabia, there were a bunch of people who'd been waiting over a year to get GPU orders filled, and they had no idea when it was going to happen. And so, they switched over to us because they could immediately get access to compute.

So, what you're seeing is a lot of these um sovereign plays are more about making sure they get the compute. It's as if we're entering the industrial age and all of a sudden everyone realizes if I plug something into an electric socket, I make all of my workers more efficient.

Well, if everyone else in the world is getting that or it feels like that's the case and you're not getting that, you're not getting the efficiency, you're not getting the compute to um augment your workforce and make them more capable, you're at risk of falling behind.

M and so for them it's as if all the power companies are focusing on some other really rich regions of the world and they want to get that power as well so that they can start running their AI models.

So is that more like the like the 5G roll out the idea of just bringing mobile to your country uh versus uh Google search I suppose is is that the more accurate analogy maybe that would be fair. That makes a lot of sense. Uh Jordy do you have anything else?

uh any any expectations around uh OpenAI's uh alleged new open source model? Oh yeah. Is that something that you guys are going to be you know integrating heavily with? If OpenAI open source is their model, we will um uh immediately launch it. Amazing. Awesome. Well, that's great news.

Uh well, thank you so much for taking the time. I appreciate you staying up late. We'll talk to you soon and congrats on the new on the new data center, new record. Cheers. Have a have a great Bye. And let's tell you about graphite. dev. graphite. dev. Code review for the age of AI.

Graphite helps teams on GitHub ship higher quality software faster. The best teams in the world, including TVPN, of course, use graphite. Um, what else? Uh, we have some timeline. We have some timeline we can go through. We got to get out of here soon. Yanni, uh, Yonyi Rickman says, uh, he wants to get long mold.

Did you see this? Yeah. So, I threw this in here. Request for startups. I want to get long mold. Basically, he says, "Mold is super hot right now. It's like the protein of environmental contaminants. It's in everything with no signs of going away. It's in the air.

" Um, and this stood out to me because uh I once rented a house that ended up having mold in it. It was a huge disaster. Uh, took a lot of time, was making me sick. Uh, and I think a lot of people suffer from mold exposure or mold ill uh, related illnesses and don't realize it at all.

So, there's a bunch of different ways that you can solve this. I don't necessarily know what they all look like. I mean, one of the biggest Yeah. Yeah. He mold testing, eight sleep of mold, smart air purifiers and testing. That makes a lot of sense. Then he says mold insurance. I maybe that's genius.

I don't know that that's not what I would have thought of, but yeah. And then mold resistant building materials. That's kind of cool. Yeah. So, there's a bunch of different categories. So, I wanted to throw this in here. If you are long mold yourself and your business, go reach out to Yoni.

I would love to see more solutions here. In other news, tomorrow night, there will be a Gro 4 release live stream. Wednesday, 8:00 p. m. Pacific. They're staying up late. I love how they always do their streams after after everyone else has left the office and gone home. They're like, "We're still here.

Uh we're going to go into our tents. Uh you know, come out of our tents and and and do a stream. " So I'm excited for this. I mean this is going to say a lot about about um you know the feels like it hasn't been long since Grock 3. So yeah there there will be a big question about where this hits.

Also uh you know we hit this like pre-training wall. It feels like you know a lot of people are saying we need new algorithms, we need new ideas and uh and I'm sure they'll be able to the the question is not like can Grock bu can XAI build a big data center and and and get to the frontier.

questions now is like can they push past the frontier with some actual like completely unthought of innovation. So that'll be interesting to track. But either way, I'm there's a ton of interesting to see what they're doing on the image side, video side. The video side would be very interesting.

I mean, uh, Axe has a lot of video data because, you know, X has been uploading video, you know, certainly not as much as YouTube, but a lot of YouTube videos get uploaded to X in full. Yeah. Imagine imagine if you can tag Grock under like a meme and say animate this. Yeah. Yeah. Turn this into GTA. That would be crazy.

20 second GTA V. That probably be very expensive on the inference side, but you know, maybe worth it. Have a lot of GPUs. Uh should we close with this Audi ad? Beautiful. Let's do it. Advantage Audi. Beautiful ad. I couldn't tell if this was a real ad. 10 out of 10. No notes. I couldn't tell if this was actually live.

If it is tennis balls, if it's not in the shape of the Audi logo. Fantastic. Fantastic. Well, thank you so much for tuning in today. We will be back tomorrow. A little chaotic startup and assembly. They didn't want us to podcast, but they couldn't stop us. They couldn't stop us. They'll never stop us.

So, we will see you tomorrow. Have a