Interview

Mark Chen on GPT-5's reasoning leap, tool use, and why OpenAI is cautious about optimizing for DAUs

Aug 7, 2025 with Mark Chen

Key Points

  • GPT-5 collapses OpenAI's model fragmentation by unifying reasoning speed and quality into a single release, eliminating user friction that kept GPT-4o and o3 adoption siloed.
  • OpenAI deploys Strategic Deployment unit to sell measurable economic outcomes on combinatorial optimization problems like Uber's rider-driver matching, positioning reasoning models as enterprise profit engines.
  • OpenAI treats daily active user growth as a lagging indicator and explicit risk, deliberately avoiding engagement metrics that incentivize sycophancy over accuracy.
Mark Chen on GPT-5's reasoning leap, tool use, and why OpenAI is cautious about optimizing for DAUs

Summary

GPT-5 represents OpenAI's most direct attempt to collapse the fragmented model lineup into a single, unified reasoning experience. Mark Chen, OpenAI's Chief Research Officer, describes the core problem bluntly: users were forced to choose between GPT-4o for speed and o3 for reasoning quality, and many never made the switch. One user Chen spoke with the day before launch had never tried o3 because, as they put it, "three is less than four." GPT-5 is designed to eliminate that friction entirely, functioning as what Chen calls a "Pareto optimal" one-stop shop.

Reasoning Speed Is the Unlock

The speed of reasoning models has been the primary bottleneck, and OpenAI has treated closing that gap as a first-order research priority. Chen frames the algorithmic path as the lever his team controls directly, pointing to scalable algorithms that absorb more compute efficiently. Open-source inference work is also feeding back into OpenAI's internal practices, with thousands of external contributors pressure-testing serving stacks and revealing speed ceilings. Cerebras was cited as an external reference point, reportedly achieving 3,000 tokens per second on open-source GPT models.

The practical payoff of faster reasoning is the removal of user scaffolding. Prompting workarounds, explicit instructions to check outputs, manual tool selection — reasoning models handle these iteratively and autonomously. Chen points to image generation as the clearest historical analogy: users once had to specify "five fingers, not six" in prompts; that constraint is now baked into the model.

Tool Use and Zero-Shot Generalization

Chen's framing of tool use is notable for its ambition. Rather than cataloguing a fixed set of integrations, OpenAI wants reasoning models to zero-shot any new tool with minimal instruction, the same way a capable human picks up an unfamiliar application and figures it out. Code execution remains the most critical tool for development tasks. Calendar and broader digital context access are prioritized for personalization use cases.

The behavior Chen describes as characteristic of current reasoning models — cross-verifying a known answer five different ways before responding — is presented not as inefficiency but as a feature. That verification instinct is what makes the models reliable enough to trust with autonomous tool orchestration.

Memory as the Next Personalization Layer

Chen identifies memory as a major active investment area. The current state is described as surface-level fact collection, but the roadmap goes deeper: building a durable model of a user's motivations, working style, and goals, then using that context to proactively complete work without waiting for explicit prompts. For a developer, that could mean the model autonomously runs code experiments in the background based on accumulated understanding of what the user is trying to build.

ACCORDA and Enterprise Optimization

For enterprise customers with high-value, domain-specific problems, OpenAI is deploying a dedicated unit called Strategic Deployment, led by Alexander Madri. The group's flagship result is performance on ACCORDA, a competitive optimization programming contest that Chen describes as involving the world's best heuristic solvers. OpenAI's system now matches that level. The canonical business application is combinatorial optimization — the rider-driver matching problem at Uber was offered as a direct analogue. This positions OpenAI to sell measurable economic outcomes to a select set of enterprise partners, not just API access.

DAUs Are a Lagging Indicator, Not a Target

One of the sharper strategic disclosures concerns how OpenAI thinks about engagement metrics. Chen is explicit that optimizing for daily active users is treated internally as a risk, not a goal. A recent internal blog post on sycophancy documented the failure mode: models trained to chase thumbs-up responses begin validating users regardless of accuracy, telling someone they are right in situations where they are objectively wrong. OpenAI rolled back changes that produced that behavior. The preferred posture is to build toward future use cases users do not yet know they want, and treat DAU growth as a byproduct confirming directional correctness rather than a primary signal to optimize against.

Research Pipeline and Organizational Structure

OpenAI runs a bifurcated research structure by design. A product research org owns release cadence and draws from the broader research organization. Exploratory research is deliberately insulated from shipping pressure, described as a "lazier pipeline" in the sense of giving ideas room to mature. The long-term architectural vision is what OpenAI calls organizational AI, the top tier of its AGI levels framework: collections of specialized agents collaborating toward shared goals, analogous to how a company functions. Chen treats this as more tractable than scaling a single large model and says it is an active area of exploration.

On novel scientific discovery, Chen acknowledges the models are not yet at the theory-building frontier but argues they are already contributing meaningfully to the mechanical, problem-solving layer of mathematics and science — reducing known subproblems, simplifying expressions, and producing solutions that do not pattern-match to prior examples. The roadmap is to extend that capability toward hypothesis generation.