Anthropic releases Claude Opus 4.7 with 3x vision resolution and hints at unreleased Mythos model on the card
Key Points
- Anthropic released Claude Opus 4.7 with tripled vision resolution, a new reasoning tier, and a 64.3% SWE-bench Pro coding score.
- Anthropic benchmarked an unreleased Mythos model in Opus 4.7's model card, signaling internal roadmap velocity ahead of public release.
- Published benchmarks are losing credibility as models produce inconsistent answers on logic puzzles, pushing enterprise teams toward live deployment testing instead.
Summary
Anthropic's Opus 4.7 Advances Vision, But Benchmarks Lose Signal
Anthropic released Claude Opus 4.7 with three concrete improvements: vision resolution tripled, a new "x high effort" reasoning tier that sits between standard high and max settings for finer control over speed versus compute on difficult problems, and a 64.3% score on SWE-bench Pro—a standard for coding capability.
The more strategic signal arrives buried in the model card itself. Anthropic benchmarked an unreleased model called Mythos against Opus 4.7, displaying performance numbers for a system not yet available to the public or even most enterprise partners. The move hints at roadmap velocity and suggests Anthropic is building ahead of release cycles, but it also illustrates a narrowing window where published benchmarks still matter.
The benchmark signal is degrading fast. Testing Opus 4.7 on riddles—one presented as "tiny man" given to a child at birth—surfaced contradictory answers that underscore how fragile these numerical comparisons have become. The model produced answers that seemed to conflate multiple interpretations (bar of soap versus cigar) in ways that don't reduce to a clean score. Across vendors, quirky benchmarks have become as unreliable as asking models to count the letter R in "strawberry" or generate humor on demand; all models show weird divergence.
The practical implication is that benchmarks have ceded authority to live deployment. Enterprise teams now test models through rollout pilots—swapping implementations, measuring real KPIs like cost reduction or sales lift, and running them for weeks in production to see if tangible results materialize. That's messier than a published number, but it's where signal actually lives.
Every deal, every interview. 5 minutes.
TBPN Digest delivers summaries of the latest fundraises, interviews and tech news from TBPN, every weekday.