LiveKit raises $100M at $1B valuation as voice AI and robotics infrastructure demand explodes

Jan 23, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Russ D'Sa

will tell you more about this property uh after our next guest joins because we have Russ from LiveKit

in the TVP Ultradom. [music] Welcome to the show. How are you doing, Russ? Yo, Jordan. Uh, how you doing? It's great to be back. Good to see you guys.

Thanks so much. Uh, first kick us off. Give us the news. What happened,

dude? The news. Uh, Live Kid is a unicorn.

Woo.

Whoa. a $100 million series C at a $1 billion valuation led by Index Ventures, Salesforce, Altimeters in, Red Points in, and you're now a billion dollar company. How does it feel?

Feels amazing.

Is the job finished? Is the job finished?

Yeah. Overnight true overnight.

True overnight success.

No, definitely not. It's days zero, but yeah, it took 20 years. Um, pretty incredible.

What were the key growth unlocks, the key KPIs? What was the first slide in the deck that got the the deal done? Uh is there a break in the graph or has it just been continual growth and you're ready to take the next step?

Yeah, I think it's really been kind of uh I think like voice AI has really just started to grow and explode and you know I kind of uh I sometimes refer to us as the accidental AI company because uh

we never meant to be an AI company. Um we were video conferencing live streaming infrastructure and then we started to work with OpenAI on Chhat GPT voice mode and uh everything changed at that moment. Um

yeah.

What what uh what do you see as a key uh like user experience UI features to a great voice mode experience? I I I I deliberately daily drive all of the major LLM platforms now and I'm starting to notice little subtleties about does this one have uh the ability to go back 15 seconds, forward 15 seconds, make it like a podcast player, more familiar UI. What do you think is the most important for uh a potential customer to if they're implementing a voice experience? How much should they make it feel like a podcast versus uh you know an avatar that they're interfacing with? um what's your what's your feeling on that?

I think it depends on the use case. So I think if you were to separate it into two broad buckets, I think that there's like the personal assistance, right, that you kind of talk to as a friend or a companion for those kinds of use cases, I think feeling very humanike. So latency I think is an important factor across both like B2B use cases and B2C use cases. But um I think especially true in the B2C use cases, you're you kind of expect that a an assistant you talk to is going to feel like talking to a human being. Um and have like a level of empathy that is not always necessary in a B2B use case. So for a B2B use case, I'll give you an example like if you're trying to do patient intake at a hospital um using an AI

call it on the phone like really the goal the the job to be done there is to get on my doctor's calendar, right? It's uh do you need it to sound like a human or to respond with the absolute lowest latency possible? No, you need it to be reliable. So, you need to make sure that it's going to actually, you know, qualify the the user calling in, make sure they have insurance, um figure out what's uh their uh, you know, affliction and then get them onto the the doctor's calendar, you know, 99.999% of the time or 100% of the time. And so the [clears throat] reliability uh of the voice agent is most important for the B2B use cases. And then maybe like the the empathy and realism is is uh more uh critical for for kind of these B TOC use cases.

What are your predictions or keys to success with the new Siri? There's been a bunch of news about uh Apple did a deal with Google. They got the best model. They got a great model uh powering it, but there's a lot of work that they need to do on the actual voice experience side. Uh what what would your be what would your best practices be for Siri v2?

Siri v2 I think like uh the two most important things I think the realism aspect is is going to be there. I mean if they're using a model from Google like Gemini live uh their their kind of voicetooice model is quite good. Um feels quite realistic. maybe uh one area where it could use a a bit more it could it could use a bit more empathy uh in some places I think and they just uh acquired a company um called Hume in the space which has a pretty strong like kind of

yeah that's right

yeah yeah the licensing deal so they that that model has a lot of like emotion and can does like sentiment analysis in real time and so I think I think that'll be an important unlock for for Apple as they're using the the Google model for for this uh new Siri. Um I think the other thing is just like having access to you know the right knowledge and reliability. Like the thing that just kills you with the original version of Siri is

it's not reliable. It like can't do half the things and sometimes it it does like a web search when you actually just want an answer and it it's uh so I think like the the reliability of it and and making sure it has access to the right tools and can invoke those tools at the right time. I think that's going to be another critical thing. But you know Apple's Yeah,

pretty incredible when they when they focus and try to nail an experience. Um, so I think I think they'll I think they'll get it done.

One one interesting thing about Siri is uh there there is a woman who voiced Siri, there's like a singular voice and I think everyone uh no matter how much they use Siri, if they hear the voice, they're they're familiar with that identity. How do you think about companies that offer uh either one singular voice and they just build a brand around it and that is really an embodiment of this particular AI system. They give it a name and a personality and it has one voice versus giving the consumer the option to have multiple voices or even steer the voice over time with like real-time sentiment analysis.

Yeah, I think it depends again on the use case. So I think I'm glad you brought this up. So for Apple for like Siri 2.0 0, right? We're talking about

um we're talking about a digital assistant or a voice assistant that is running at like the scale of the entire world, right? Like Apple devices are everywhere.

And so when you think about a delivering a great experience at scale like that or at that level of scale,

you have to think um about kind of how do you meet the user where they are. So to speak on voices in particular can't just be one voice. I mean like uh it has to speak different languages. Well, right. That's one thing. It uh culturally like um you know, you have to think about accents and like some of the paralinguistic cues, you know, if I say um or the way that I the way that I speak

also varies or the way someone speaks varies across cultures as well, varies across languages. There's different customs. And so if you're trying to build a a voice assistant that can meet the needs or feel like the right experience at the scale of the entire world, um you have to go really deep on kind of all of these different aspects, conversational dynamics, what are the right voices, do they have the right accents, um do they speak the right languages reliably? All of that stuff uh matters. And um

and so that's a you know that's that's something that at Apple's scale uh and scope they have to solve that for something that's a bit more contained like at a hospital in a particular part of the US you may you may not have the the same kind of constraints or requirements.

What uh what's your opportunity in robotics?

It's the it's the next big wave. I think that like uh robotics has kind of this 80% overlap with with voice AI in that um you can think of like a humanoid robot which are a lot of the robots that are getting built now by companies. You're not going to interact with that that robot um with a keyboard. Um you're going to you're going to talk to it and it's going to have this additional capability where it's going to have eyes and be able to see you and move uh in you know in response to what you do and your actions. And so you're going to talk to that thing and uh that's the overlap, but then there's a lot of other new stuff that you have to do in the robotics use case because it has vision um and because it has limited connectivity. So that robot may be out there in the field. It may not always be able to to connect to the network. You may need to be able to do things kind of locally or on a local network um if connectivity is compromised. So there's like a lot of opportunity to build in the robotics space. It's still a bit earlier uh than than I would say voice AI is now in terms of like how it's scaling up and being adopted, but it's uh it's a wave behind voice AI that I think is going to be even bigger than voice AI.

Are you tracking benchmarks around latency? I'm interested to know what you think about uh the progress to reduce latency in voice interfaces and particularly what the bottlenecks are. Do we need to just distill models down further? Do we need custom silicon? Are there going to be dedicated chips AS6 for inferencing um voice models? Because I'm not sure if you've seen those Instagram reels where people are doing like the human impression of chat GPT voice mode and they sort of like pause and it's it's funny. It's a very

talk about the guy who's like I'm I'm lying in front of the train tracks. So yeah, you've probably seen these and you're seeing you're seeing your product in action. Honestly, it's a great ad for for voice mode,

but it feels it feels very much like we're we're almost in like the dialup internet era of these voice interfaces where there's this like little delay and that's clearly going away, but I'm interested to know like technically what what needs to happen to remove the delay from voice interfaces.

Yeah. So the way to think about it is that there's kind of a bunch of different components to this end to end experience of like I speak

model thinks model speaks back right um and along uh along that kind of path you have the network latency um I I'm speaking only to the primary components there's a bunch of little things too in the middle as well but there's network latency so getting the voice from me to the to the machine um wherever it's located then there is like the process of understanding when am I done speaking that's called turn detection like when have I done am I done sharing my thoughts and then the model can now speak back to me or think and then speak back to me so there the the step of figuring out when is the user done speaking that introduces some latency then there's the actual inference part of it and it depends on whether you're using a full voicetooice model so a model that takes in voice directly and spits out voice or whether you're doing a cascade approach where it's uh the first model converts it into text the next model is the LLM and then the next model is taking the tokens coming out of the LLM and turning it back into speech. Um, if it's the kind of the cascaded approach, then you have like three different spots where additional latency can come in, right? Like there's three models here. They all have to run um and then they have to pass information uh between uh each of those models. So there's another latency or place where latency creeps in. And then when the model finally spits out voice, whether that's from a a TTS model or from the the voicetooice LLM itself, then you have latency of the network where the AI's voice travels over the network and then it's played out on my on my phone. And so it turns out you can actually like shrink the latency in each of these different components. Where are you going to get the biggest biggest bang for the buck? I would say there's two places where you get the biggest bang for the buck. The first one, three places. The first one is uh that kind of three models or one model. So is it voice going straight into the model and then being processed and then spit out or are you actually doing this kind of like handoff uh between three different models? That's the first spot where you're going to get a big reduction in latency. The second place where you'll get a big reduction in latency is um on the turn detection piece. So figuring out like when is the user done speaking um and you know having the delay be as short as possible between okay the user's done speaking let me pipe the data straight into the model or maybe I'm already streaming it into the model and then it's deciding when the user is done speaking um and has its response ready to go and then the third place is uh on queuing. So like you have many many people that are trying to hit these models at once they're all using voice mode or etc at the same time. um how do you figure out how to load balance uh that workload that demand that users have that speech data that the users are are sending to the model? How do you figure out how to make sure that there's a model already waiting and consuming that speech versus making a queue up and wait and then this person goes and gets a response and this person goes. So doing the load balancing across those GPU workloads is another and then and then as you mentioned I think like you can get a speed up on the model side from hosting on on better hardware or

you know chips that are dedicated uh to to a particular type of particular type of architecture.

Yeah. Yeah. That makes a ton of sense.

Yeah.

Uh that's exciting. Are are you focused uh like you're raising a lot of money right now. Is this focused on OPEX capex? Are you going to build your own data centers? Have you have you cons confronted like the build versus buy debate internally?

Yeah, I think for data centers, I mean, we have really healthy margins now, so there's no um there's no pressure to go and you know, build our own metal yet, but uh we'll we'll get there over time. Um you know, that's something that we've always talked about uh eventually kind of kind of fully vertically integrating the stack. But um you know, I think that uh the capital for us is is going to be used in in two primary ways. I think the first one is that we started our life as like network infrastructure.

Um but it turned and and that's used to be the product that we sold even for voice AI companies but it turns out that like

when we had when we were selling voice network infrastructure um someone would say well how do I uh how do I like test this model you know I built it with your software and now I'm running it on your network like but how do I test it? We're like well go over here uh use this vendor and figure it out like glue glue this stuff together. And then they were like, well, how do I deploy it? And we're like, well, you know, use this vendor. And then they're like, well, how do I observe it? And like, you know, all the data generated, like, do I use data dog? What do I do here? And we're like, well, you can kind of piece it together this way. And um, we kept hearing this over and over and over. And then we said, okay, well, we're just going to start building out all the pieces because right now building a web application is very familiar, very easy, and you usually have one single platform that allows you to build out the entire thing, right? Like Nex.js JS has kind of become this default Nex.js and Verscell have become like the default platform for building out a web application but a voice AI application a robotics application these things are actually very different from a web application. It's a completely different architecture. At the top of the iceberg, it looks like, oh, well, instead of typing and and clicking a mouse, now I'm talking um and the AI can see me. But just that kind of change at the input layer underneath changes everything about the underlying architecture of how that application is built and all the infrastructure you need to get that thing built. And so what we're doing is we're really building every single piece of that across the entire development life cycle so that you can start with LiveKit from zero just a dream scale to the moon in production

in reality with LiveKit and you don't really need to like go anywhere else. You can do everything within the platform. And so the first part is just building out

this product, right? The surface area of it is much wider than where we started. So that's the first thing for the capital. And then the second thing is is you know really investing in Devril and some go to market around education like helping developers understand how can they leverage this platform to accelerate what they're trying to build. Um you know it's like lots of sample apps and workshops and and things like that. um events uh to to make sure that people know that there's this tool that's kind of magical and can accelerate, you know, their their roadmap and and and their uh their kind of progress towards the vision of what they're trying to do.

Yeah, makes sense. Well, congratulations again. Thank you so much for taking the time to come chat with us and hope you have a great rest.

Yeah, have a great weekend. We'll talk to you soon.

You too. You too. Appreciate it.

Goodbye.

Let me tell you about Figma. Figma make isn't your average vibe coding tool. It lives in Figma, so outputs look good, feel real, and stay connected to how teams build, create codebacked