Interview

Crosby publishes first AI lawyer negotiation benchmark — models score 44–50%, revealing big gaps in judgment

Jun 17, 2026 with Ryan Daniels

Key Points

  • Crosby publishes first benchmark for AI lawyer performance on multi-turn contract negotiations, with frontier models scoring 44–50%, exposing a critical gap between initial document review and sustained judgment.
  • Models consistently over-accept unfavorable terms to close deals, while skilled lawyers protect clients while keeping counterparties engaged—a reasoning failure current AI cannot yet thread.
  • Post-training on negotiation reasoning and judgment calls remains unexploited across closed models, leaving meaningful upside above current scores as legal economics favor AI-assisted work under $1,000 per hour in attorney fees.
Crosby publishes first AI lawyer negotiation benchmark — models score 44–50%, revealing big gaps in judgment

Crosby publishes first AI negotiation benchmark — top models score 44–50%

Crosby, an AI-powered law firm that pairs agents with a captive attorney team to handle contract negotiation end-to-end, has published what Ryan Daniels describes as the first benchmark designed to measure how well frontier models handle multi-turn legal negotiations, not just initial document review.

The scores are close together and uniformly modest. GPT-5.5 led at 50.5%, Fable (Claude 4) came in at 47.3%, Gemini 3.5 Flash at 45.1%, and Claude Opus 4.8 at 44.4%. Daniels flags the Fable result as unreliable — Crosby completed only one test run before losing API access — so the 47.3% figure should be treated as provisional.

This is the first benchmark that just looks at a negotiation between two lawyers as far as we can get to completion... Models are really likely to just say yes and accept some term because you wanna keep the deal going, and that's just not good lawyering. That's where the real judgment came in.

Benchmark construction

The ground truth problem is genuinely hard here. Unlike math or code, legal negotiation has no single correct answer. Crosby's approach, developed with AI data and evaluation firm Micro One, was to have large groups of senior lawyers negotiate the same positions independently and map the overlap in their responses. Early in a negotiation, experienced lawyers converge; as turns accumulate, their strategies diverge. Crosby built a rubric from those patterns and used it to score model behavior across the full negotiation arc.

Where models fail

The sharpest failure mode isn't the first red line — models actually improved on that as the benchmark matured — it's judgment under pressure to close. Models consistently over-accept unfavorable terms to keep the deal moving, which Daniels says is precisely what good lawyers avoid. A skilled negotiator can make a counterparty feel heard while still protecting the client. Current models mostly can't thread that needle.

Harness and post-training

All benchmark tooling is published open source. Daniels says building a harness that edits Word documents reliably has proven unexpectedly difficult, but bespoke post-training on reasoning and stated judgment calls is the more significant lever. The closed models are not being fully exploited yet — his framing — which leaves meaningful headroom above the current 44–50% band.

Token economics

At legal billing rates, token costs are a secondary concern. Kirkland recently allocated $500 million to internal AI development, and Daniels notes that as long as compute spend stays below the equivalent of $1,000 per hour in attorney fees — which he says is hard to reach even with heavy usage — the economics still favor AI-assisted work over pure headcount. Crosby is spending aggressively, including a substantial early-access run with Fable, on the thesis that models remain cheaper than senior lawyers for the tasks that don't require attorney judgment.

Every deal, every interview. 5 minutes.

TBPN Digest delivers summaries of the latest fundraises, interviews and tech news from TBPN, every weekday.