Will Brown on AI scaling limits, RL as the real 2025 trend, and why enterprise AI adoption is slower than advertised
Apr 21, 2025 with Will Brown
Key Points
- Reinforcement learning, not model scale, is driving AI capability gains in 2025, with smaller models trained via RL outperforming much larger base models for enterprise tasks.
- Enterprise AI adoption remains slower than advertised because most pilots are deliberately isolated from production workflows, and converting them to regulated business processes carries switching costs comparable to changing cloud providers.
- DeepSeek's efficiency on 2,000 GPUs while Anthropic shows capacity warnings reveals significant inference optimization headroom left by US labs, prompting OpenAI to recruit optimization specialists from high-frequency trading.
Summary
Will Brown — who works at Morgan Stanley and follows AI development closely — lays out a view that reinforcement learning, not raw scale, is the real technical story of 2025. The framing matters for anyone sizing up where frontier AI investment actually creates durable value.
RL over scale
Brown is skeptical that bigger transformers are the next unlock. His argument: the training data wall is real, diminishing returns on capital are already showing up, and the performance gap between genuinely massive models and mid-sized ones doesn't justify the compute cost. The economic sweet spot, in his view, sits somewhere between 30 billion and a few hundred billion parameters. When Meta releases a behemoth model, most enterprises won't run it for day-to-day work.
What actually explains o3's quality, Brown argues, is that it was trained via RL to use tools correctly — not that it's simply larger. GPT-4 was a bigger model than many in use today and had comparable training data, but it wasn't trained this way. The agent wave people have been anticipating for two years only started working when RL got applied properly. His bet is that smaller models trained with agentic RL will outperform much larger base models for most real enterprise tasks.
He frames current AI capability in terms of task duration rather than abstract AGI benchmarks. O3 can reliably handle most things a competent human completes in ten minutes. A full design system built coherently across an entire product — the kind of thing that takes days — is still out of reach.
Enterprise adoption
On finance specifically, Brown's read is more cautious than the headline narrative. Morgan Stanley had integrations ready the day GPT-4 launched, so large institutions can move fast on high-priority initiatives. But the long tail of smaller, less-defined tasks — the ones that make up most of a knowledge worker's day — don't have clean drop-in replacements. Vendors take months to onboard, and engineering resources to build internal tools are scarce.
Deep Research works well for what it's designed for, but ask it to produce a table of 50 precise data points and it starts making mistakes. The report format has enough slack that errors go unnoticed; structured outputs don't. Microsoft's PowerPoint Copilot, he says flatly, is not something he's heard people say saves them time.
The J&J pattern — hundreds of pilots, then cutting roughly 90% — is, in Brown's telling, how enterprise AI evaluation is supposed to work. Most pilots are deliberately walled off, run by internal beta testers, and never intended to roll out company-wide. Converting a pilot to a workflow embedded across a regulated institution is closer in friction to switching cloud providers than to installing software. That friction is structural, not temporary.
Windsurf's success as a Cursor competitor, he argues, comes directly from prioritizing enterprise integration while Cursor has not. The OpenAI acquisition of Windsurf reads to him as a signal of friction with Microsoft — buying a dedicated enterprise coding product rather than relying on Copilot suggests OpenAI is putting distance between itself and the Microsoft channel.
DeepSeek
Brown treats DeepSeek as an engineering story, not a geopolitical one. The fact that DeepSeek serves all of China on roughly 2,000 GPUs, while Anthropic still shows capacity warnings to paid users, illustrates how much inference efficiency headroom the major US labs left on the table. Sam Altman recruiting from high-frequency trading firms — optimizers by training — reads to Brown as OpenAI acknowledging it needs to close that gap. Porting specific techniques like FP8 quantization or mixture-of-experts optimizations back to other architectures is harder than it looked in week two of the DeepSeek moment, because inference optimization is highly model-specific and reliability at scale adds its own complexity.