Baseten raises $303M Series D at $5B valuation as enterprise AI inference hits an inflection point
Jan 23, 2026 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Tuhin Srivastava
are you doing?
Hey, how you guys?
Good to see you.
Welcome back.
Welcome back.
Thanks for having me again.
You've been busy.
You've been busy. Give us the news.
Yeah, look, we've had a crazy year. Um, you know, we're 10x over last year. [laughter]
Um,
10x alert. Hit the gong for the 10x.
Sorry to interrupt.
All right, back back to you.
Back to business. Last time I was on here, I was telling you about our series D. We've raised our series D now.
We've raised 300 $300 million is a $5 billion valuation
uh led by IBP and Capital G with participation from Nvidia and a few others.
Um look um it's just been I just to remind you guys where like what we do.
Yeah, please um AI infrastructure platform. We focus on production grade inference serving the fastest growing companies in the world. So a bridge open evidence cursor notion clay gamma writer you you know you look at that that crop of companies and how they're growing and you know we're playing a supporting role in making sure that they can run their models
as fast and reliable as possible. Um and you just on the back of that um you know we realized how big this could potentially be.
Yeah. Yeah. Go ahead.
Take it Yeah. Take us through how quickly enterprises are ramping up running their own models. Like if we go back to 2022, uh Google had Palm internally. They hadn't really launched Gemini. OpenAI had GPT3 and then they launched Jad GPT. Uh but Microsoft had you know they were vending GPT4 through uh through Bing and stuff like but big companies were not really running their own large language models. Is that roughly correct? But now it feels like we're in the hundreds of companies. But out of the Fortune 500, like how many how many companies are actually running their own models at this point in time?
Yeah, I look I I I think it's still relatively small but growing growing rapidly. I think, you know, if you go and look at like when we go and speak to enterprise customers, it's still incredibly early. But I think what they are doing is that they're looking at, you know, the these other companies that, you know, they, you know, you and I would probably agree that are the for future Fortune 500 kind of eating their lunch and if they don't like figure out how to move at that pace, I think they'll just be left behind. And so like what we we focus really really heavily on you know those really fast growing companies because we we think not only are they going to be the future Fortune 500 but the Fortune 500 are going to look at them as like how do we bring these models up to speed as quickly as possible.
Yeah. And so I mean that makes perfect sense in the case of uh of a company like Cursor. Obviously they're they have a ton of training data. They have a need for a custom model. Uh are we years away from someone like a Coca-Cola or you know developing their own or how does that how does that play out? Yeah,
I I I I think look there's two things that are happening right which is
like if you go look at open AI anthropic adoption at these at these company it is there is definitely it is definitely there um once you've proven out these use cases internally then you just have to think about better faster more specialized more secure running on prem all these sets of problems and so you know that is coming like is it like six months away is it 12 months away is it 18 months away who knows um I think the other trend
that is pushing ing um the enterprises online is RL which is like how how do you take these open source based models um and RL them and get them to be as good if not better than um these frontier models for very specific tasks and you know that's a big focus for us as well we recently made an acquisition around that um but we do think that'll be a big driver of enterprise adoption of these
mod walk me through the push back from a large enterprise uh that says yeah it's great I I can train a custom model on my business, but I looked at the smaller models for a variety of benchmarks. And it feels like the big models just sort of are better at the small models tasks as well. Like I should just use the latest and greatest big model for everything because the frontier is better at the specialized tasks as well. It just oneshots things.
Yeah, totally. Well, I I think the um and I I think honestly a lot of the times they'll be valid there. You be like, "Yeah, like I I I don't disagree with you." I think what what you will see though is that as the frontier is getting better, open source is already
um you know, if you go look at the open source models that are out there today like the GLMs, the coins and um and the deepse it's not like they are 18 months behind. Yeah.
You know, some would argue that they're, you know, within a quarter and for certain tasks are better. The complexity actually comes around honestly from enterprises. Not so much like can I get a better outcome. Yeah. It's hey
you know these large um lab companies have massive inference teams.
Sure.
Um do I have the skill set to be able to run the service at scale? Yeah. Within my premise with all my enterprise requirements. Got it.
That's ideally where we can help in and be that partner to them to bring them online.
Okay. So, walk me through uh what it looks like when a Fortune 500 company says, "Okay, we're getting off of just a generic model. We want to do something that we have control over the inference. We're working with base 10. We're doing uh you know, we're doing some custom RL. There's going to be some proprietary data in there. Who's involved? Are they hiring you or consultant to set up the RL environment? Who's doing the training?" Walk through all that.
Yeah, I think just engineers and infrastructure engineers. I think that that is like one of the realities right now is that I don't engineers who don't know how to grapple with AI just you know they don't I don't know if they exist in the future to be honest. Um but um that being said like for a lot of engineers it's like hey how can we make it as easy as possible for you to take this custom model deploy it scale it up either you know in our cloud or within your own cloud and so that is you know the job of the software. We have a a really amazing for deploy team that is very happy very happy to help um if necessary but our goal is also just you know to enable them to do these things and and that is happening I think it's early though especially the enterprise
from uh from your view do you think we need more Neolabs like based on the conversations that you have with customers
okay I'm look more models are great I think more people developing models are amazing look I I'm making a met uh with my career with this company on on the long tale of models and that their existing models outside of you know two amazing companies. Um and so I think more people developing models just gives consumers and developers more choices down the line and like look do we need more of the same? Probably not. But like you know all all these companies have their own takes um on on on the problem. uh help me understand your current thinking around model routers, routing inference across uh big heavyduty models that might be very expensive like using all parts of the parade of frontier. Uh what are you seeing on uh on the inside on uh are are companies trying to handle parceling out the workloads themselves or do they want that to be something that's off the shelf that you sort of provide? What are some of the trade-offs that people are considering these days?
Yeah, look, I I think it's a spectrum. I think, you know, the the most advanced companies are 100% doing that. They are breaking up they're breaking up their
their tasks u almost like model by model and then figuring out how to route.
I think less sophisticated folks um or people earlier in their journey, yeah, probably better way to put it um are relying on other people to do that. And you know, the model the model companies are doing this themselves. I I think um the place where I struggle a bit is like you know some of these routing things are just you know routing for failover which is hey model just stop working.
Sure. Sure. Sure.
But like we we believe that like reliability like should should be table stakes. Yeah. For serving models and you know ideally you don't have to build across what happens when X model provider falls over Y model provider falls over and have that t and that is a different routing problem. Routing based on capability is happening. it is happening outside the model. It's happening inside the model and as the longtail model comes on come online for specific model tasks. I think there's more and more going to be the case that that is being handled by inference company or inference platform like banan as well.
Yeah. I mean that's certainly happened in the database market like you have a big postgress installation you don't have a my SQL installation there as like a backup but you might have a reddest casting layer in front of it right. Yeah. Um, so, uh, what is what help me understand that at the hardware layer? Uh, I know you're you have some insight into Nvidia's strategy here. I'm very interested in uh how you or other partners, other uh folks in inference might be thinking about the future of Nvidia's like big powerful racks versus their more legacy chips that might be depreciating, but you could still run a great model on it versus some of the more exciting stuff that's happening with Grock in the future.
Yeah. Yeah. Yeah. Look, I I don't have a ton of strategy. They're very good. I could hypothesize. Okay. You know, look, they're amazing partners to us. I I think we we are chip agnostic and we think look every every um task and it will have different requirements from a latency perspective from a cost perspective. Yeah.
Even what type of model runs on it and so yeah so like we use A100s we use A100s but we also use E200s and GB200s and as
as you know we get these you know new these new types of chips or across we'll be working on them as well. Like if you think about the Nvidia graph stuff, you know, what are they solving for? It's like, you know, it's kind of breaking out pre-fill and decode. You're using GPUs for pre-fill, you know, the computebound problems. You can sac saturate the GPU. You can do batching and have really good throughput, but then you have the LPUs or like the gro chips or the decode which are memory bound. You're not doing a lot of new math per token. Why this is hard is that that's a pretty complex um orchestration problem between of handling um workloads that are doing stuff on GPUs and on LPUs. I think Nvidia's, you know, obviously built amazing software around this um to break out, refill
and decode Dynamo is the name of the software that know we work pretty heavily with. But like I think that would just be a new type of chip that the chip providers
um provide. And you know I think um one thing I don't think you can understate is how much how much how powerful um Nvidia supply chain is CUDA is
CUDA is and and their their ability to
um you know be dynamic with architectures changing and I think yeah
um you know we we have very very much bought into that ecosystem and what that enables especially for inference and customers. Can you give us your take or any insight on how companies at the application layer are kind of wrestling with uh the like laziness specifically like everyone at this point has experienced you know product like uh ask asking you know basically uh you know wanting a product to be able to do something that you know it can do and then and then uh you know running into this like laziness element. What uh
what what do you mean by laziness? I I don't quite get that. Uh I would say like you know thinking about uh let's let's pick the you know biggest uh not to pick on anyone but like the biggest consumer AI app chatbt right like everyone has experience asking chatbt to do something that you know it can do
and like sometimes it will just kind of like circle around or or like not really get to the thing that you want it to do
um even if you're paying for it on like the max plan or something like that. And so and and so I would like I would assume that every company at the application layer
that gets to the point where they're caring about margins
is like starting to get to the point where they're like okay like you don't always cuz sometimes the model can be lazy and you're like great like
yeah you got me quick
you got me you got me what I needed and I can just move on. Yet other times it's like wildly frustrating because it's like, you know, asking if like it's like asking an employee to do something at five o'clock and they're like, "Oh, like, you know, [laughter]
I I mean, isn't that like to to me that's like what defines a great application layer company, right?" Like the
Yeah. The ones that aren't good are just wrapping around, you know, GPT and you or or whatever model. And, you know, you as a user are left to figure like deal with that laziness. I think the best application layer companies, you know, be
it's like serving the right amount of intelligence for the
tax and and setting up the harness and how to use it for the problem that your user is asking you to solve for them or you were trying to solve like you know I if open if if a company like open evidence was just you know doing something the lazy way that would not only be really really expensive but they wouldn't do a really good job of solving that problem for their users and it wouldn't be a great business in the long term. like you know what makes these companies so good is how they use the models, how do they use multiple models and how they architect that to you know the thing the user is asking them and then
then layering in how do we do this efficiently so we can make a business out of it and like you know to me there is going to be a great divide in solving that laziness problem. I think it's a good way to put it actually.
You said two things that I'm sort of having trouble reconciling. you said that uh you're chip agnostic, but then you also praised the the the dominance of the CUDA ecosystem. Uh is it is it getting easier to implement LLMs in a way that run on multiple chipsets? We've obviously seen uh really strong performance from deep mind on the TPU stack and anthropics talking about TPU now. Um but for a long time the narrative was like, oh, it's going to be really cumbersome to replplatform off of off the CUDA ecosystem. How are you thinking about uh yeah just multi-chip architectures these days?
Yeah, I I mean look I I think it is getting easier
but it's not it is not easy you know. [laughter]
Welcome to entrepreneurship
but you know like you know it's a and like look just sitting on top of like you know that's why like Nvidia is such a great partner for us. It's like sitting on them doing the stuff you know we were I mean did you guys play FIFA or anything like that when you were growing up with Madden? Like
I was more of Counterstrike, Call of Duty.
Sure. Starcraft.
Yeah.
It's pretty amazing that when we are doing stuff for GPUs today, we're downloading from this a website looks exactly the same as the Nvidia driver page.
Oh yeah. [laughter]
2000. But that is also amazingness of it of like how rich that is and how powerful it is. And I think that is,
you know, yes, it is getting easier and like like a diversification is great everywhere downstream, but also like Nvidia is just amazing at what they do. and and and
and and and being able to run a model on AMD does not necessarily mean that it will inference at a lower total cost per token or vice versa might be a little higher on certain models blah blah blah uh and so yeah there's all these different trade-offs but that's why companies come to base 10 correct
totally and that's why the open source ecosystem is so important right that's why you know like
you know you you mentioned databases earlier like I think similar to databases
at the in the fullness of time open source has the fastest run times
and so I think that will like with Nvidia you will see that which is like you know do you know who's going to be really good at running stuff on Nvidia chips is going to be Nvidia who's going to be really good at stuff running stuff on AMD chips is going to be AMD and so working with these providers of chips I think to get the best run times from them
um is very important and I think like this cross compilation stuff while important you know I I'm a little skeptical I'd say it's just like you know like it's fant principle for me to think that we would be better at running um software on Nvidia chips than Nvidia or or better
yeah at the lowest level obviously there's all the orchestration software stuff that we are building that we think is very important but we also very very invested in those ecosystems
well I believe in you I think you can do it but no I I take your point of course uh well congratulations on all the progress thank you so much time to hop on the stream and uh have a great rest of your day
sure you'll be back on
any day now [laughter]
hopefully not Um, thanks for having me and um, appreciate you guys.
We'll talk to you soon. Goodbye.
Phantom Cash, fund your wallet without exchanges or middlemen and spend with the Phantom Card. Up next, we have
Bryce.
Bryce from Nominal. I believe Bryce is