Interview

Rivet publishes TaxBench: top AI models pass less than 31% of multi-step tax questions reliably

May 4, 2026 with Nick Abouzeid

Key Points

  • Rivet Tax's TaxBench benchmark finds leading AI models from Anthropic and Grok pass fewer than 31% of multi-step tax questions reliably when asked repeatedly, revealing consistency failures that undermine production deployment.
  • Model version upgrades introduce regressions: Claude 4.7 underperforms 4.6, GPT-5.4 Pro beats 5.5, and Grok 4.1 outscores 4.2, forcing businesses to benchmark each release independently.
  • AI models fail catastrophically on data retrieval tasks, hallucinating figures and misreading documents where a junior accountant achieves 100% accuracy, making human oversight mandatory to catch errors the IRS will eventually flag.

Summary

TaxBench

Rivet Tax, an AI-enabled accounting firm handling tax returns and advisory work for thousands of companies, has published TaxBench, an internal benchmark it developed to evaluate whether AI models can reliably perform multi-step tax preparation work. The results are not flattering for the frontier labs.

The core finding: top models, including those from Anthropic and Grok, pass fewer than 31% of multi-step tax questions reliably when asked the same question five times in a row. Many models perform well on a single pass but fall apart under repeated sampling. Reliability at scale, not single-shot accuracy, is what breaks them.

Version regressions are a real problem. Claude 4.7 underperforms 4.6. GPT-5.4 Pro beats GPT-5.5. Grok 4.1 outscores Grok 4.2. Newer is not reliably better, which means businesses deploying these models in production workflows need to benchmark each release independently rather than assuming an upgrade is safe.

The top models don't get those questions right if you ask the same question five times in a row... Four seven performs worse than four six from Anthropic. 5.4 Pro performs better than 5.5. The 4.1 model from Grok beat their 4.2 model — so it's not always newest is best... We get a ton of leverage as long as they are paired with a human who really knows what they're doing.

The lowest scores on TaxBench come from data retrieval tasks, where models are given hundreds of pages of PDFs, Slack messages, and emails and asked to extract specific figures, such as carry-forward capital losses from a 2023 tax return. A junior CPA, Abouzeid says, would find that figure correctly 100 times out of 100. The models make numbers up. They fail to OCR photographed documents correctly, skip intermediate steps, and pull the wrong figures without flagging the error.

The failure mode on reasoning questions is a specific one. A model checking whether a client's stock qualifies for QSBS might correctly verify the five-year holding period and the gross asset limit at acquisition, then surface a news article about New York reconsidering QSBS treatment and conclude the stock doesn't qualify. It can't distinguish a live legislative proposal from settled law.

Human oversight stays structural. Rivet's position is that AI delivers significant leverage only when paired with an accountant who knows what to check. The problem isn't just that errors occur, it's that a non-accountant client has no way to catch them. A wrong carry-forward figure or a missed $3,000 capital loss allowance won't look like a hallucination. It'll look like a tax return, until the IRS letters arrive.

None of the major labs have reached out about TaxBench. Abouzeid says he's not surprised: researcher priorities and production business priorities are different things. Rivet intends to pick the best-performing model for its workflows regardless.

The benchmark includes an open submission entry for labs wanting to run stealth or unreleased models against the test. Thomson Reuters' CoCounsel, built on top of the Thomson Reuters tax library, is the most impressive tax-specific tool Abouzeid has seen, though it currently has no API access and runs at hundreds of dollars per seat per month. Rivet expects to build on it when API access opens later this year, and is building its own harness in the meantime.

Every deal, every interview. 5 minutes.

TBPN Digest delivers summaries of the latest fundraises, interviews and tech news from TBPN, every weekday.