Interview

Cua: Computer use AI agents wrapped in a full OS sandbox — infrastructure play betting 20% of AI agent scenarios will rely on browser/computer tools

Jun 11, 2025 with Franchesco & Aleandro

Key Points

  • Cua builds AI agents that operate via screenshot interpretation inside an OS sandbox, betting that 20% of multi-agent workflows in five years will require browser and desktop tools rather than APIs.
  • The YC W25 company's open-source framework has 8,000 GitHub stars and converts users into cloud customers; its next revenue layer targets becoming an inference provider for computer-vision UI models.
  • Cua uses hybrid execution combining deterministic RPA for predictable paths with AI inference only when workflows deviate, addressing the gap between prototype reliability and production-ready agents.
Cua: Computer use AI agents wrapped in a full OS sandbox — infrastructure play betting 20% of AI agent scenarios will rely on browser/computer tools

Summary

Cua, a YC W25 company founded by Franchesco and Aleandro (CEO), builds computer-use AI agents wrapped inside a full OS sandbox, essentially a Docker container for AI agents. Instead of parsing HTML or reading accessibility trees, Cua uses screenshot-based, pixel-level models to interpret and interact with the screen. Franchesco, who spent over five years on the Windows team at Microsoft working on agent benchmarks, says research backs this approach as more reliable for operating system environments where there is no DOM to parse.

Cua's founders believe that in five years, roughly 80% of multi-agent workflows will run via API, but the remaining 20% will require browser and computer tools. That 20% is their market. Franchesco estimates computer-use agents are still about six months from a "ChatGPT moment," while browser-use agents are already there.

Open source traction

Cua has an open-source framework with over 8,000 GitHub stars. The go-to-market strategy flows from that. Inbound users ask how to productionize locally-working workflows at scale, and Cua charges on compute for cloud deployment. The next revenue layer is becoming an LLM inference provider for computer-vision UI models. Franchesco argues that OpenAI and Anthropic dominate perception mainly through PR. Baidu's model already outperforms both on computer-use benchmarks, but those alternatives are hard to discover and set up. Cua wants to be the catalog and platform for that model ecosystem.

Vertical strategy

Cua has deliberately avoided picking a vertical. Early customers arrived with highly idiosyncratic requests—contractors editing video on macOS, for instance—and no common workflow pattern emerged. The founders are letting customers chase verticals for them while Cua stays horizontal.

Episodic memory and reliability

A workflow that works 80% of the time isn't production-ready. Cua is building episodic memory to address this. Deterministic, RPA-style execution handles predictable paths, and full computer-use inference kicks in only when a webpage changes or the agent deviates from the expected trajectory. The hybrid architecture treats legacy UI automation and AI inference as complementary rather than competing.

Benchmarks

Franchesco is skeptical of current computer-use benchmarks. WindowsArena, derived from OSWorld and developed partly during his time at Microsoft, assigns tasks like opening VLC or using LibreOffice—software few users rely on today. He expects the next generation of benchmarks to measure real-world tasks, potentially including qualitative human evaluation similar to LM Arena, where human raters watch two agents complete tasks side by side and judge on fluency and coherence, not just task completion.

Both founders moved to San Francisco three months ago specifically for YC. Aleandro says the dream of doing YC dates back to when he was 16.