Captions founder Gaurav Misra on building Canva for video and why talking-head video is AI's neglected frontier

Apr 17, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

to welcome him to the show. How you doing today? Boom. Hello. Hello. There we go. What's going on? How's it going? Uh it's great. Thank you so much for being here. Thank you so much for creating the captions app. I use it very very regularly. Du bas. Oh my god. I'm obsessed. Yeah. Yeah. For a long time.

I mean, we we make a ton of clips. You've probably seen uh and it's always a hassle to put captions over, but you made it much easier. Uh a lot of that. I want to talk about uh the history of the technology, the whisper turning point, and then get into value creation, the application layer.

There's so much we can talk about, but why don't you kick it off with just like introduction on like how you're describing the company the these days and then the most recent announcement. Totally. I mean, so love that. By the way, you're an OG user, so thank you for being a user.

But you know it's a pretty young company like we're our product market fit was like 2 and 1/2 years ago right? So like we were four people 2 and a half years ago and things have grown really really fast. all started with adding captions, you know, your most basic thing, but it's evolved quite a bit.

And um so today, I mean, the way we think about the company today is like, you know, video creation is hard, right? And we've identified two problems actually, right? One is recording the video is hard. You guys know this, right? And editing the video is hard too, right? It's pretty technical.

And we want to solve these two problems. Like these two problems we want to help people jump over with AI, right? So if you want to edit video yourself, if you want to record video yourself, go somewhere else basically, right? We're going to actually do it for you. That's the value, right?

So not going after Da Vinci Resolve, not going after Premiere, an entirely new market. Exactly. Right. And we think of it very similar to Canva actually like you know Canva for video has as an idea has been sitting around for a while and I think it's actually finally possible because of AI, right?

Because the real value of Canva is you start with something, right? You don't have a blank screen and you know it's not built for the designer, right? Most designers will look at Canva be like I could probably do better than this, right? But it's built for the person who's not a designer, right?

And that's the same value that we provide. So on that note, right, like last year we started working on essentially foundation model technology for video generation and for editing that's going to help us achieve that. And you know, these are big projects, very expensive, uh, but also very cutting edge.

Um, I think the most exciting part on the video gen side for me is like we're very much focused on like talking videos, right? Which is and by the way, like I'm also like kind of surprised almost in a way that we've spent so much time and so much money doing text to video on silent videos, right?

Like what what's the point, right? Like just stock videos. Like that's that's the negligible part of what a video is. Yeah. Almost no time spent on videos with actual communication. So, that's kind of what we've been focused on for the last year. Yeah.

Uh what what uh like what I don't know what what like breakthroughs are are are the most interesting to you? Obviously, like Whisper is super important. I feel like Whisper is great and then all of a sudden it came down to like, okay, I want re real time Whisper on the show and then I got to go build that.

Or uh Notebook LM. I like Notebook LM like it still doesn't have an app even though Google is paying people not to work and stuff. It makes no sense to me. Uh and and I'm and I'm imagining that like captions could be an app where I'm getting like a notebook LM style like YouTube videos.

YouTube's talked about this a little bit. They haven't rolled anything out, but you're starting to see a lot of this stuff. A lot of it's in like the slop tier, but what I like about captions is that you can still inject enough of the human element to take it from it's a tool that's used in partnership with a human.

So, it still has that art in there. Um, but but what what is exciting you and what's the most interesting in terms of like where you want this to go? Yeah. I mean, honestly, like there's a pretty clear distinction that's developing that I'm starting to see which is like amongst the foundation models, right?

Like there's the text generation like LLM type models, right? And those are solving a very difficult problem is intelligence, right? Like unsolved problem. No one's solved intelligence before, right? So, and by the way, we don't even know what the bound is, right? Like where does it end? Who knows?

It could never end, right? It could go on forever, right? And then on the other side, you got, you know, media generation. Like this is everything from like video generation to music generation, sound, audio, like all this stuff, right? These are solved problems. Like we can do rendering today, right?

Like we can literally render anything you want with like CGI and stuff, right? Of course. It's just becoming a lot easier and it's also bounded, right? Which means that there is a limit of realism, right? And then you've kind of solved it essentially, right?

you can get more real than real and once you're there you kind of have achieved what you set out to achieve right and so I think it's a different type of problem and also it means that it's not about replacing the human right because what's actually happening is the craft is evolving right the craft is different but the creativity is still there whereas on the LLM side that's actually potentially replacing the human not going to lie right that's potentially what it's going to do right so these are two different types of use cases almost two different types of value that are being produced today.

I think we fall definitely much more in that sort of media generation category where like our goal is not to replace anybody, right? It's actually like empower a bunch more people potentially, right? And who knows, right? People come in, they use captions to make their videos, right?

And we edit it for them, we like generate the video for them and you know that just gets them started, but like two, three, four years down the line, they you know, they move on to Premiere Pro or something like that's awesome. Nothing wrong, right?

So, how do you um on the developer side, we're seeing a bunch of startups where it seems like everything is just converging into this like one shot, right? Where it's like you have Lovable, Bolt, uh Ripling, uh uh sorry, uh Riplet, Replet, sorry. Rippling is one shot for your HIS system. That's right. That's right.

Um so everything's kind of converging onto this like text box where you just tell it what you want and then it makes it makes a website or an app or things like that.

Uh I imagine content will maybe go that way for some use cases and that's kind of like there's a lot of sort of momentum and convergence around that moment. Um how do you see and but that's just you know my point of view.

What what's your point of view on like how all this evolves and um how you're looking to continue to differentiate captions over time other than just sort of like chasing perfect realism? Totally. Yeah. I mean, so couple things there.

Like I think generally uh kind of to, you know, make a comparison what you're talking about, right? Like I think of it as like Canva for everything. That's actually what's happening, right? Because the magic of Canva, you know, awesome company.

I think the magic of it is not about how simple the UI is or something like that. The magic is that you start with something. It's not a blank page, right? And really the biggest enemy of anything creative is the blank page. Yeah. And not the UI and stuff, right? I mean, think about like design software in general.

You look at like Figma and stuff. Like Figma has like six buttons, right? Like it's not a hard UI, right? But it's really hard to make something good with it. Like it's really hard to figure out how to use squares and circles to make something that looks good, right?

And I think the magic of Canva is you start with something, right? You're already 90% there when you enter and then you kind of make some tweaks to get it to 100, right? And by the way, like think about chat GPT. It's kind of the same thing, right?

We're using it all over the place today, but it just gets you started, right? Like it's like boom, I already have something like I need a job description. Boom, there's job description, right? For whatever job you want, right? And then it may not be perfect, right?

You make a few tweaks, you know, change things here and there and you're done, right? And so it's like Canva for everything. That's what's happening. And I think same for music generation or, you know, video generation, like all these things are going there.

Our goal and our mission in this like we're focused on specifically the communication sort of vertical, right? So think about this, right? If you think about a movie or, you know, a TV show or anything like that, right? Any kind of like media today, um, only a small part of that is B-roll, right?

Like if this was a movie, like I'm in New York, so like it might open with like a shot of the Empire State Building and then the next scene like, oh, there's a New York taxi cab on the street passing by really quick in 2 seconds and then the camera is in the room and we're talking, right?

And that's actually the movie, right? And so so much time and money has been spent on making the shot of the Empire State Building and almost nothing on like actually getting the dialogue going, right? That's kind of the the weird thing, right?

And like our thing is like let's get that communication that dialogue problem solved, right? That's one. And on the other side, just footage isn't enough. So let's get it edited to make it actually an asset. Right. Can you talk a little bit about uh growth for captions?

Um, there's a weird dynamic where uh it can be extremely valuable to go viral with like a oneshot thing. I'm thinking of Lensa, those magic avatars that it was just upload a couple photos and you get a photo of yourself. Then we had the studio Gibli moment which was a huge growth vector.

Uh, it's not OpenAI's product, but it was still probably massively beneficial just to drive a bunch of extra installations and chat GPT use, right?

And so I could imagine you guys thinking like, hey, let's go make a one-click Harry Potter Balenciaga generator and like we are just like really good at making like Harry Potter Balenciaga style videos. Of course, you need to put in your own tweak, but that's what we're great at.

But you don't want to get pigeonholed into that, but it can be a good growth driver. Are you thinking about that conscious consciously? Are you thinking about like how can I get uh the next how can I get the next Studio Giblly moment to happen in the captions app?

Yeah, I mean so we are but our philosophy on this honestly like think about both the studio giveway thing but also think about like original chat GPT right like I remember the time where you know GPT was available like I would use I would show it to my friends like check this out like check how how cool this is and people would be like oh wow cool okay right and then suddenly chat GPT came out and like by the way I think it was very clear that they hadn't prepared for the amount of virality that thing got right like even the chat GPT kind of gives that away, right?

And so it wasn't a planned thing. It kind of just happened, right? So you give the same thing. I don't think they planned it. Like it just kind of happened, right? So I think if you create the right environment where people are given the creativity to go try something that's like an awesome technology, right?

They they can play around with it, make cool things with it. Like these types of moments kind of happen naturally. It has happened for us several times with different technologies we released in the past. And a lot of times it's been unplanned. It's just like, you know, when we plan for it too much, it doesn't happen.

when we don't plan for it, it just like suddenly explodes like completely, right? So, that's kind of what we've seen. Um, and it's something we think about of like how do we create that wow experience because at the end of the day, a lot of the growth and virality is happening because people are just blown away.

It's just so impressive, right? It's beyond anything anyone's ever seen, right? And I think that is a pretty high bar. So, building on that, building in private until we reach that bar, releasing it as like a wow, this is like crazy. like that's the type of stuff that we've seen work really well. Uh, next question.

Do you have any sort of visions around uh what the just video content on the internet in 2030? Because I have this uh right now there's not a huge incentive to make a video for one person, right? Especially in a business context.

If I if you want to explain something for 10 minutes to somebody in a business context, you pick up the phone, you call them, you spend the time to send an email or whatever.

Now, with something like captions, it's like, well, I could just generate a video of like how my product works and why it's relevant to this industry and all that. And so, I have this sense that content generation is gonna like 100x thousandx, but human attention is not going to 100x or thousandx, right?

It's just not possible. There's only so much time in the day. People use their phone for six, eight hours on average right now. They're still spending two hours a day on Netflix, but we're not going to suddenly like get, you know, a 100 hours a day.

uh even though people on you know Tik Tok uh entrepreneur influencers might might want that. Um, so I just have this um kind of question around like do you think the average video in the future gets one view, two views and and maybe it's not the average, right?

Because certain views will or certain videos will still go super viral and be sort of cultural phenomenons, but um but yeah, I'm curious if you think that's kind of like where we're headed. Uh right. I mean, I think for what it's worth, I think the average video today is getting probably one or two views, right?

Because like think about Snapchat like a lot of video probably a billion videos a day but sent to like you know a few people are seeing it basically right for the most part private communication and honestly like Snapchat pioneered that like they kind of missed the Tik Tok part of it.

it wasn't part of their ethos to be honest, right? But like they were more about the private communication.

But uh I I think the future is more video like for what it's worth like you know the way and I think there's an interesting sort of like move that I think will happen towards more video in AI as well because think about like how communication's changed over time, right?

Like we're not sending letters to people as often anymore, right? Like text messaging is kind of like very prone to miscommunication, right? Audio is definitely better. Phone call make goes a long way. Video call is like one step further, right? And then real life meeting is even beyond that, right?

Often times there's like even within companies like right there's miscommunication and like mistrust that builds when there's remote teams or things like that, right? And you got to watch for that. Uh whereas an in-person team just trusts each other so much more.

So there is definitely something to be said about like these more sort of multimodal forms of communication right to use the the term but u I I do think that actually even on the AI side right like sure chat GPT makes me a great writer but like what makes me a great communicator right and we're really not thinking about that right because communication is multimodal in itself right like the words that I'm saying right now where I'm pausing what I'm emphasizing how my micro expressions are moving like how my body is moving like all that is communicating in multiple forms, right?

A message and that message changes if I change any of those things. Like word words might be the same, right? But I can change the message completely by just changing the delivery of it, right? So I think today's technologies like just aren't capturing, right? Like how broad uh communication actually is.

And I think it will all evolve towards video over time just as we've seen like this is not new. We've seen this happen before, right? So and by the way like I was at Snap when like Tik Tok took off, right? uh 2019 that era and initially like Tik Tok grew a lot on the back of Snap.

Not a lot a lot of people know this but like they were running like a hund00 million a month of ads on Snapchat right initially right when they were not there was no presence into the hen house. Yeah.

And like there was concern like in the company people were concerned that like are we are we like creating a competitor here right and people we were running tests. Yeah. Narrator they were Exactly. But like we ran AB tests to test, right? Like if someone sees a Tik Tok ad, are they less likely to engage in Snapchat?

Tests didn't show that, right? But the reality is that it wasn't true, right? They they did uh spend less time on Snapchat. So, well, we got to run. This was a fantastic conversation. Uh thank you so much for hopping on and we'll we'll definitely talk to you soon. Yeah, great great talking. Thanks for coming on. Bye.

We got Ian in the waiting room from Astromec. Uh, Astromeca. I believe I'm pronouncing that correctly. Astromeca. Uh, great. Mechanica. Astromechanica. Sorry, I mis I I mistyped that. Astromechanica. Anyway, we'll let him explain it to us. Come on in, Ian. Sorry for the wait. Sorry for the wait.

← Back to story