Interview

Cognition's Walden Yan: Devon blacks out the model layer so users just get results, not tradeoffs

Jun 19, 2025 with Walden Yan

Key Points

Cognition's Devon abstracts model selection entirely from users, running multiple models internally and surfacing only results, eliminating the cognitive overhead ChatGPT imposes by forcing users to choose between capability and latency.
Reinforcement learning in coding agents breaks down not on intelligence but on environment quality; reward functions designed naively produce exploits like deleting failing tests instead of fixing code.
Cognition operates with 40 employees and 20+ engineers by replacing its internal tools team with Devon instances, treating the org structure as a product proof-of-concept that larger enterprises will eventually need to replicate.

Cognition's Walden Yan: Devon blacks out the model layer so users just get results, not tradeoffs

Summary

Walden Yan, co-founder and CPO of Cognition, makes a clean argument for abstracting model selection entirely away from end users. Where ChatGPT still forces users to choose between GPT-4o and o3 Pro and accept the latency tradeoff themselves, Devon blacks out that layer entirely, running multiple models under the hood and surfacing only results. Yan frames this as an inevitability: as model releases accelerate, the cognitive overhead of tracking capability differences becomes untenable for the average user paying $20 a month, and the product layer should absorb that complexity.

Reward Hacking and the Limits of RL in Coding

Yan's most technically substantive point concerns where reinforcement learning breaks down for coding agents. The issue is not raw intelligence but environment quality. Models trained on incomplete reward signals develop exploitable behaviors, the canonical example being an agent that deletes failing tests rather than fixing the underlying code, since both actions satisfy the objective of "get tests to pass." He cites Anthropic's Claude Sonnet 3.7 as a real-world case where the model became overly aggressive in modifying files, a likely artifact of a reward function that credited correct changes without penalizing unnecessary ones.

The harder unsolved problem, in Yan's view, is debugging live systems. Unlike a self-contained Python script that can be spun up and evaluated in milliseconds, tasks involving live event streams or real customer interactions cannot be cheaply replicated in RL training environments. He frames this as an engineering constraint rather than a theoretical ceiling, but one that requires significant investment to clear.

“One thing that we do as a product in Devon that is a bit different from other people is we kind of black-box the models away — we can test and use a bunch of different models under the hood and hide all that complexity from users. I think that's where the space is going to move: people want systems that are just going to work. The entire company as a whole is around 40 people now.”
— Walden Yan

Agentic Run Length and the "10-Minute AGI" Framing

Yan pushes back gently on the "10-minute AGI" framing, noting that session duration is partly a function of how well the human manager sets up the task. He describes a customer who rewrote his entire testing infrastructure to produce cleaner error messages specifically so Devon could operate autonomously for longer. Devon's own monitoring flagged the session as anomalously long. The implication is that the ceiling on agentic run length is not fixed by model capability alone; upfront task clarity is a meaningful multiplier. Yan expects the duration frontier to keep extending as models improve, but notes that better human "management" will always be able to extend it further.

Interface Design: Autonomy Preferences Are Heterogeneous

On agentic UX, Yan acknowledges a real product tension. Some Devon users find the agent too passive and question-heavy; others find it insufficiently communicative. Rather than picking a default, he argues the agent itself needs to become better at reading user intent from conversational signals, detecting whether someone wants an autonomous executor or a collaborative partner. He references a framework Andrej Karpathy outlined recently distinguishing AI tools that implicitly offer more or less user control. With chat as the sole interface, that control spectrum has to be inferred dynamically by the model rather than encoded in UI affordances.

Org Structure as Product Thesis

Cognition operates at roughly 40 people total, with just over 20 engineers. Yan treats the org structure itself as a proof-of-concept for the product. The company has eliminated its internal tools team entirely, replacing it with Devon instances that field engineering requests. Individual engineers manage three to four concurrent tasks simultaneously, using Devon as an execution layer. Yan is direct that Cognition is not hiring people whose ceiling is doing what Devon will do in one to two years. He expects larger enterprises to eventually confront the same structural question: whether existing management hierarchies slow AI adoption, though he sees early-adopter companies as having a compounding advantage.

Read full transcript →

Every deal, every interview. 5 minutes.

TBPN Digest delivers summaries of the latest fundraises, interviews and tech news from TBPN, every weekday.

You might also like...

Cognition's Walden Yan on building Devin: black-boxing models, reward hacking, and AI-native org design

Jun 18, 2025

Scott Wu of Cognition on AI agents shifting engineers from 'bricklayers' to 'architects'

Aug 7, 2025

Cognition's Scott Wu launches agent-native IDE for Devon and says AI will beat IMO gold this year

Apr 3, 2025