Interview

Captions founder Gaurav Misra on building Canva for video and why talking-head video is AI's neglected frontier

Apr 17, 2025 with Gaurav Misra

Key Points

  • Captions positions itself as Canva for video, targeting creators who would never use professional tools like Premiere Pro rather than replacing them for power users.
  • AI video generation has focused on text-to-silent-video and B-roll while neglecting talking-head content, where most dialogue and communication actually happens.
  • Captions' viral growth has come from shipping impressive products unprompted rather than engineering moments, a pattern Misra observed with ChatGPT's unexpected scale.
Captions founder Gaurav Misra on building Canva for video and why talking-head video is AI's neglected frontier

Summary

Gaurav Misra, co-founder of Captions, describes the company as Canva for video — a deliberate positioning against professional tools like DaVinci Resolve and Adobe Premiere rather than a replacement for them. The pitch is that the two hardest problems in video are recording and editing, and Captions wants AI to do both on the user's behalf. The target is the person who would never open Premiere Pro, not the person who already knows how.

The company found product-market fit roughly two and a half years ago, starting from four people and growing quickly from there. It started with automated caption overlays and has since expanded toward foundation model work on video generation and editing — described as expensive, multi-year projects now underway.

The neglected frontier: talking-head video

Misra's sharpest observation is about where the industry has misspent its energy. The AI video generation field has poured capital into text-to-silent-video — B-roll, stock footage, aesthetic clips — while almost no one has focused on videos where people actually talk. His argument is that dialogue and communication are the substance of most video content, not the establishing shot of the Empire State Building. Captions is explicitly focused on that communication vertical.

He draws a clean distinction between two categories of foundation model work. Media generation — video, audio, music — is a bounded problem: photorealism has a ceiling, and once you hit it, you've solved what you set out to solve. LLM-style intelligence is unbounded, and plausibly does replace human cognitive labor. Captions sits in the bounded category, where the goal is to expand who can create, not eliminate the creator.

Virality

On growth, Misra is candid that the best viral moments have been unplanned. His read on ChatGPT's launch is that OpenAI wasn't prepared for the scale of the response — and the same pattern has repeated at Captions. When the team has engineered for a viral moment, it hasn't landed; when they've shipped something genuinely impressive and let users play, it has. The strategy he describes is building in private until the product clears a "wow" threshold, then releasing.

Video as communication infrastructure

On where video goes by 2030, Misra argues that communication has always migrated toward higher-fidelity formats — from letters to texts to calls to video — and AI-generated video is part of that same arc. He flags that text strips out micro-expressions, pacing, emphasis, and body language, all of which carry meaning. Tools that only make you a better writer don't make you a better communicator, and that gap is largely unaddressed.

He also notes, from his time at Snap, that TikTok was running roughly $100 million a month in ads on Snapchat during its early US growth phase — effectively using Snap's distribution to build a competitor. Internal A/B tests at Snap suggested TikTok ads weren't cannibalizing Snap engagement, but the real-world outcome was that users did shift time away. It's a footnote on platform risk that Misra raises without pushing further, but it sits underneath the broader point: video attention is finite even as video supply is about to expand dramatically.