Anduril's design lead on building anime-style defense films in five weeks and staying ahead of copycats
Apr 10, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Jen Miller
of the entire technology industry. Um so welcome to the studio, Russ. How you doing today? I'm good. I'm good. All right. So you're saying that I'm following fusion reactors. We're following Fusion Reactors and immediately before that we were talking about fine watches made in Switzerland.
So we got a little bit of everything today. Taking us on a whirlwind tour. I don't know if I can follow Fusion Reactors, you guys. I'll try my best. Well, we're excited to talk about Voice AI. Yeah. Uh but why don't why don't you introduce yourself and the company and then we can talk about the news and go from there.
Sure. Yeah. Um so I'm Russ. I'm the CEO co-founder of a company called Live Kit. live kit. Um so LiveKit uh what we we started actually as an open source project.
So uh back during the pandemic, you know, if you think about all the technology that Zoom has um underneath, we started an open source project uh that kind of built something similar to all the technology that underpins Zoom, but as a an API for any developer to integrate similar type of technology into their app.
So you could connect any two people or any number of people on the planet kind of instantly with less than 100 milliseconds of latency over video and voice.
Um, fast forward like about a year and a half, two years and uh we built a demo when chat GPT first came out uh where you could talk to it instead of text with it using live kits infrastructure and a few months later OpenAI ended up finding it.
uh wanted to build a voice interface to uh Chachi BT and we started to work uh pretty closely with them um on voice mode and then advanced voice mode after that and uh kind of almost by accident I guess we got uh pulled into uh this AI space AI it was not a space back then accident yeah I remember our series A which we closed at the uh we closed the first trunch at the end of uh 2023 2020 just at the start of 4 and uh everybody was just telling me that hey this is cool but voice isn't going to be a thing for 3 to 5 years and uh then of course GPT40 happened and yep can you steal man at all why speech to text is still bad on my iPhone?
I I just really wanted to be good is John's a big voice guy. He's always he's always he's always using LiveKit unknowingly. I worked at I worked at a voice AI company in 2010 or something. Twilio. Yeah. It was you know Dragon naturally speaking. I do.
Dragon eventually acquired this company and it was like it was direct competitor to Siri. Apple bought Siri and Vingo was purchased by Dragon and their primary it was such a fun startup. I think I remembering Blingo. It was it was a good company. I think it was a good good exit for everyone involved too.
Um and their their primary target was Blackberry. you would and their whole goal was let's get pre-installed as Blackberry, but there were a lot of there was a lot of chaos in the market at the time with uh it was a $20 app then they had to cut their pricing, do an ad model, all these different things going on.
Um but um I' I'd love to know like just if you were in charge of, you know, Apple, uh how would you improve speech to text? Um speech to text specifically, like transcribing text. Oh, well well anything in in voice agents or just just just improving the customer experience broadly like surprise and delight, right?
Well, you know, it's one of those things that I think is funny like people talk about hallucinations as this bad thing. Um but in a lot of ways for like kind of this speechtoech interaction model where you're talking to an AI, right? Hallucination is a feature.
So, if you remember like your Alexa or Siri, like when you were talking about this, you know, 2010, 2012, you go and you ask it a question and it's just going through a bunch of like a decision tree and if it hits, you know, a dead end and it can't answer the question, it just doesn't do anything or it says, "I can't answer the question.
" And so, the the second that it it can't do something you expect it to do, all of a sudden you you know, it's a it's such a punishing user experience. You just don't want to use the thing anymore and you stop using the thing.
But with kind of these new models that actually, you know, hallucinate answers into existence, even if it doesn't know the answer, it tries, um, you always get a response, uh, that comes back from the model. And for for like voice interfaces, that actually can be more of a feature than a bug.
Um, it also helps that these models, it's funny because it's it's human, right? So like if you're having a conversation, John askked me a question, I don't know the answer. I go, well, like is it like this thing? And you're like, no, it's this thing.
And like but but for some reason with like you know these older generation of models it was like very annoying to like go down that wrong sort of Exactly. And it's always the same answer, right? It's like I don't know or I can't help you with that. I can't help you with that or I'll search Google for you or whatever.
Um and so I think that like just having these LLMs that have ingested the entire internet and can always generate some plausible answer for everything whether it's 100% correct or not.
I think that that um just makes the user experience so much better with these kind of modern or contemporary uh voice agents that you interact with. So, you know, I I I think the the other thing I'll say related to the Apple stuff and um that's improving very quickly is a latency.
So, like we have a lot of um providers out there now, Grock, Cerebrris, folks like that who can run inference much faster than um even a year ago for for some of the model providers. And now like LLM inference can actually be done in less time than generating speech with with TTS.
And so um I think like getting that latency end to end down you know to 300 milliseconds or 500 milliseconds on average for like turn latency um that that that kind of helps you cross this uncanny valley for for voice AI. I mean, should we be thinking like even faster than 300 milliseconds?
And is there any efforts that you've seen to bake some of these models down into silicon? We've seen what Etched is doing with putting the transformer architecture on silicon. Uh you could imagine that once midjourney gets good enough uh or kind of hit some some peak that they would just bake it down into silicon.
We would just have image generation models, you know, AS6 essentially like what happened with Bitcoin. Uh is that the future here? Not that you would pivot into hardware, but maybe you would like vend your software into a hardware provider at that at that level.
Um, well, so we like I mean we partner with like folks like Cerebrus and Grock. Um, and so we allow you to kind of plug in their models and plug in effectively their hardware accelerated inference. Um, and and so we're we're compatible with that world.
I think on the inference side, you can defin I think it's going to continue to push as low as it can get. Um there there's obviously some limits.
You know, it's a trade-off between uh kind of capabilities and level of knowledge and how fast you can kind kind of, you know, run that that pass through uh the model to get the result out. So there are trade-offs of course that follow the laws of physics, but there's also kind of diminishing returns after a while.
To give you an example, I once built this uh this cerebrous demo. I used like a llama 7B or 8 8B. Uh llama 8B. Have to remember these numbers on the printer counts.
But uh llama 8b hooked up to cerebrus and I got a bunch of feedback on that voice demo that the model was responding too fast and can you slow it down and it's kind of going off the rails a little bit. Yeah. Yeah. Interesting.
I I mean on that note uh is there a kind of like like at inference time fine-tuning that needs to happen on the voice agent side to give the human listener the appropriate interaction.
Like some people want to have, you know, this really fast back and forth conversation and I'm like just get to the point you're I'm just trying to book a flight. Can I just like just give me the information as quickly and condensed as possible?
Other people might might prefer an agent that speaks slower and really trs things out and gives the full context and then lets them answer and and kind of ping-pong back and forth that way. Uh is there any movement towards like that level of fine-tuning that you think might happen in the future? Is it already happening?
I don't know. Yeah, it's already happening. So, I think that there's kind of two uh different flavors of this. Um and you can kind of combine them together uh over time.
So the first one is that if you've interacted with like the the real time models like the ones that are so there's open AAI realtime API there's Gemini multimmoal live API um there's a few others coming out as well and um for all of these models uh that natively understand uh audio um you can actually tell them to you know whisper or slow down or uh speed up or act hyper.
you can kind of give them an explicit signal of um the style or the way that you want them to communicate with you. So that's already available and possible with these models.
Um then there's another part of it which I think will come uh in the next you know year or two uh where the model will implicitly be intelligent enough to pick up on what your pacing is or your state of mind is just based on the way that you're talking and expressing yourself.
Uh, and also if we weave computer vision into it, it might see you and understand from visuals, okay, this person is stressed or this person seems like they're in a hurry or this person is calm and like relaxed and it can kind of tell that by your body language and by the way you're speaking and it can automatically adjust to you in the same way a human would be able to.
Um, so you guys work with a lot of the biggest companies in voice already. Open AI, Speak, Character AI. You're working with Tinder. I see like really broad swath of companies, Spotify, Oracle, things like that. Are you seeing as many startups as you would like sign up and start building in voice AI?
It feels like a category that for if you go back like 5 years ago, people were really excited about voice, but then you had these sort of like high-profile uh companies that didn't quite work, but you know, maybe in many ways they were just too early.
So what are you seeing on the on the startup side in terms of new companies being formed specifically to leverage voice and and LiveKit and the underlying models?
Yeah, we're seeing uh probably about um you know we're doing like thousands and thousands many several thousands of um of like signups to the to the cloud product, our commercial product. And um most of those the vast majority are are startups and growing companies.
And out of those probably around 75% or so 80% of those signups are uh are voice AI companies that are building voice agents.
And so the way that I kind of see the market uh segmented to a degree is that you have the large AI labs and they have popular, you know, consumer apps and a lot of them are building kind of open-ended voice agents that you can talk to about kind of anything, right? They're assistants.
you can they do question answering, they do therapy for mental health, all kinds of language learning in the case of speak. Um, and then on the other side you have these kind of pockets that are what I call kind of voice native systems.
And those are really anything you pick up the telephone uh to, you know, when you call someone on the other end, call a business, there's someone that answers that line and they're either, you know, doing patient intake at a hospital or they're doing loan qualification or insurance eligibility checking.
Um there's a lot of these kind of business process flows where um these are there are pockets that are like really large in nature. So customer support is the one that gets talked about a lot, right? Like some of the markets that Sierra Decagon and folks like that are playing in.
Um, and so for all of those systems, we're seeing tons of startups flood into those spaces because it is now viable to take um, an LLM and have like a voice stack like LiveKit, for example, that's hooked up to that LLM and and you can build like an automated version of whoever is picking up the phone on the other end.
Well, yeah.
The thing the thing that I'm excited about, everybody's gotten on one of these like robo uh CX calls and you you know I've had the experience in the past where I just say talk to a human until I get to a human because I know they're not going to resolve like the issue the way that I want and I'm like the annoying guy to the robot.
But but I do feel like there's a point here quickly where the AI can be considerably better than the human on the other end because they're not tired. They're not in a bad mood that day. like, you know, they're they're really, you know, they they just have like high energy because they're code, right?
There's going to be a flipping and it's going to be like, "Oh, I'm talking to a human. Robot, robot, please put me on with the robot. " Yes, throw me on the robot. Yeah.
I was uh I was on a customer support call with Comcast uh the other day and um I was talking to a human for sure and they were trying to look up something for me and then they started asking me if I'd ever seen the snow and if I'd ever spent time in the snow and I was like, "What? " and like, "Where do you live?
" And I'm like, "California. " They're like, "So, do you experience snow in California? " I'm like, "Well, in Tahoe, yeah, I guess. " But, uh, it was so weird. And I was just like, "Am I talking to an AI right now, or is this a human? " And if it's a human, I kind of want the AI. And so, it's very bizarre.
Super awkward. Uh, can can you tell me about some of the more like nuts and bolts uh enterprise customers you're working with? Uh, I see this case study from Playback. We actually had the CEO of Playback on the show. Uh, live streaming for sports. Makes a ton of sense.
uh do you know exactly how they're leveraging LiveKit and how you can uh kind of give us a more concrete example of like their infrastructure basically? Yeah, for sure.
Um so Playback uh they're also going to integrate uh you know AI into that flow as well like a voice-based commentator or whatever that can watch the game and provide an overlay for for fans. But they use us for a pretty different use case.
In that case, they um they have these kind of courtside cameras at like NBA games, MLB games and all of that. And these cameras are, you know, the way that you watch sports on TV.
Um, and so one interesting part about, and this is kind of like a deeper technology thing, but one, if you've ever watched like the Super Bowl or the NBA Finals or any any sports game on a TV and then like you're texting your friends about it and they're watching the game or there's someone in another part of your house that has the game on on a different TV, you'll notice that you're not synchronized.
Like you're seeing plays before they are sometimes up to like 30 seconds or a minute before. And so the technology that we use for live broadcasts of sports games is not true real-time technology.
Not everyone is synchronized in the same way that in 1969 when the you know the lunar landing um everybody saw that at the same time because there's a server that is sending all these kind of broadcasts out over uh TV antennas that are receiving these things and uh it is truly a shared experience and so playback what they do is they ingest uh the the video feed from these cameras uh at courtside or at the MLB game and that's they're ingesting it using LiveKit that goes into a single system and then LiveKit's cloud network effectively everybody who's watching that that NBA game for example uh we are effectively shuttling those bytes from that camera uh in a synchronized way to every single person that's watching that game.
So when like Steph syncs a three, everybody who's watching that game through playback is seeing that three get sunk at the same time. able to truly have a shared experience of watching their favorite team play the game together, cheer together, all of that stuff.
And so, um, playback is effectively using us for the backbone of all of the audio and video, um, kind of transmission that is going on within their application. Makes sense. Uh, before we let you go, we should probably talk about the news today. You guys had a bit of an announcement.
Uh, why don't why don't you share a little bit there? Yeah, for sure. Um, so kind of how I mentioned at the start, right? We started as an open source project, you know, connecting humans to other humans.
Um, but now we have found ourselves kind of operating at a very large scale, um, connecting humans to machines, so using voice and and computer vision. And, uh, and so we, uh, we closed a series B today. Um, that's led by Alimter Capital. Um, let's do it. [Music] Let's go. Let's go. Um, yeah. Yeah.
So, we we closed that round and uh now we're we're really going after voice AI. We're building like an all-in-one platform for anybody any developer to build a voice agent. Give us the stats. 1 million 2 million. How much did you raise? 45 million. 45 million. I got I got I got to do 45ion.
I'm not going to submit to that. But, uh congrating amazing milestone. I I think uh you know uh clearly like a a fiveyear overnight success. Yes. classic classic example. But um it's cool.
I I love these stories where where you sort of work on a really hard problem and then discover like you know the entire most powerful application you know years later. So congratulations to you and the teams. Great to have you on your new official voice AI infrastructure correspondent. You love that line.
I'll just make I'll make a clone of me and then that will be your official Yeah. voice AI correspondent. Cool guys. That's perfect. That's perfect. Thanks so much for stopping by. Thanks for coming. Really appreciate it. Have a great one. Talk soon. Talk soon.
And next up, we got Ev Randall coming in from Kleiner Perkins. I love how like there's always like you can always go niche enough to make somebody a correspondent. Yeah, exactly. Hyper hyper niche. Uh anyway, $45 million. That's a great That's a great round.
We There are so many funding rounds happening now that we're like in the thick of it. I'm realizing like, wow, like the money is flowing in Silicon Valley. It's great. But we will get a full market update from EV hear what he's seeing over at Kleiner.
Uh uh actually crossed paths with him at Founders Fund two years ago, maybe two and a half years ago. Uh great dude. Wrote one of the most probably banger memos during the Zerp era all about Tiger Global and crossover investing and I want to follow up with him on that,