Crosby publishes first AI lawyer negotiation benchmark — models score 44–50%, revealing big gaps in judgment

Jun 17, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Ryan Daniels

Speaker 1: NPS Just do go do it.

Speaker 2: NPS was legit Yeah. Through the roof.

Speaker 1: When you're raising that series d or f Yeah. Why not just

Speaker 2: Just do it at

Speaker 1: the Head over

Speaker 2: New York Stock Exchange.

Speaker 1: The New York Stock Exchange.

Speaker 2: We got Ryan Daniels from Crosby, the cofounder and CEO in the waiting room. Let's bring him into the TBPN UltraDome. Ryan, how you doing? Good to see you again. What's the latest? Good to see you, guys. Welcome back. You're bench maxing. You're bench maxing.

Speaker 1: Bench maxing.

Speaker 10: Always gotta benchmark.

Speaker 2: Tell us about benchmarks. What are you launching? Why is it important? Break it down for us.

Speaker 10: So today, we published a benchmark. It's the first benchmark that we know of that just looks at it's it's simple in theory, but a negotiation between two lawyers Mhmm. Like, as far as we can get to completion.

Speaker 2: Yeah.

Speaker 10: And I think the main takeaway is there's a lot of art to negotiating, and the way that lawyers negotiate contracts is really complex because there's so many moving variables.

Speaker 2: Mhmm.

Speaker 10: And this is what we do at Crosby. We try to get those closed faster. And so for the first time, we can actually measure, you know, how do agents do on the first red line? That's fine. We kind of know that already, but how do they go over and over through the process where the judgment of a lawyer really takes hold and they just try to figure out who is their counterparty and how can they get them to yes.

Speaker 2: Next step, you gotta train a voice model to be able to yell on the phone at the other client, at the counterparty. Let them know that you're not doing another turn of the forms. The battle of the forms is over. Now, obviously, there's a lot that goes into this. Talk about the data. Are you getting good red line data, ground truth data from firms? Are they happy to partner with you on this? Like, what is the value prop to actually get the correct data? Are you doing it yourself through a data firm? Like, what's the strategy to get the ground truth here? Because it is it's not the same as, like, a math problem that you can just verify with Lean.

Speaker 10: Yeah. That's that's, like, that's the key point of it. And and, like, you're totally right. Like like so we worked with Micro One really closely on this with Ali and his team. They've been amazing. They have, like, a really great legal focus that they're building out. And, you know, we have our own captive law firm here. That's been our kind of stick from the beginning was in order to get the full benefits of what agents can do in law, start your own law firm, sell end to end legal work. And as agents get better, our our our margins get better. And we worked with Ali, and and the hard thing is, like, there is no right answer. And we basically had large groups of real lawyers negotiating all with the same positions for the same parties and seeing, like, how much of those lawyers agree with one another on the right way to respond to those other lawyers. And so we could just see the overlap. And as the negotiation went longer and longer, we saw a ton of deviation between those lawyers, which is fascinating. Like, they're all very senior. They're all very experienced. This is a real negotiation. But at the beginning, the lawyers kinda more or less did the same thing, and we just saw the way the agents, we then built a rubric based on how those lawyers actually worked and tried to create some sort of, like, ground truth for the best way to negotiate. But, like, to your point, like, I don't know. Maybe if we just did everything in all caps, the candidate party would just accept it, and we could yell through the margin. So you have to try it.

Speaker 2: That's why you pay the big bucks if you got big Yeah. Big lungs and can scream. What like, it seems like all the models did very similarly. 50% 50.5 for GPT, 5.5, weird numerology there. 5.5 gets a 50.5. Gemini 3.5 flash got a 45.1. Opus 4.8 got 44.4, and Fable five got 47.3. That seems surprising. I mean, seems like the window for testing was low. Or is that something like Fable is more tuned to cybersecurity bio, those sort of reinforced learnings, those those feedback loops, and maybe it hasn't been applied to to to legal yet. But what was your interpretation?

Speaker 10: I we put a big asterisk on Fable because we got one testing run through and then we lost the access. Wow. So we we still would wanna do a few more. Sure. If and when

Speaker 2: Yeah.

Speaker 10: We get we get access back. You know, I think I think it's not surprising. I think what was most interesting to us were a few things. Mhmm. One, in the first red line, like, first review of document, you know, thing, and and agents, like, really did poorly. They scored they scored poorly there because they couldn't service those first issues. Mhmm. As the as the new issue went went on, we knew the issues. Models did quite a bit better. But the biggest thing was models are really likely to just say yes and accept some term because you wanna keep the deal going, and that was part of the instruction. And that's just not good lawyer. Right? Like, a great lawyer knows how to kinda make you feel like you're getting your your side, but but still push the deal forward and and and protect the client. And that's where the real judgment came in. And so, you know, in our mind, we have a lot of work to do because we we want our agents to be able to take the work off our lawyer's hands, to be able to think through something and say, hey. Like, we pushed back on this for this reason and really get the counterparty there.

Speaker 4: Mhmm.

Speaker 10: So, you know, I just think the models did better than we thought they would, but there's a lot of work to do.

Speaker 2: We're having a little bit of interference. I have one more question, Jordan. Go for need to get in? Just wanna know about the value of the harness because it seems like all of these models are sort of neck and neck Yeah. You know, within a few percentage. What happens if you wrap them in a very bespoke harness that your team builds? Can you get to 70%, 90%? Like, where do you think the the upper bound is on this benchmark?

Speaker 10: Yeah. That that's a great question. And, like, every you know, everything's published. It's all open source. So, like, the harness that we use everything can all be replicated. Yeah. There's a lot of ways to push. I think, you know, we've invested a lot in in the harness that we wrote specifically that's really good at editing Word documents. That's turned out to be a really hard thing.

Speaker 2: Yep.

Speaker 10: And so there's ways to push there. You know, I think ultimately, you know, more and more bespoke models that are, like, really post trained on just, like, making judgment calls, stating their reasoning will be an area. So, like, both of those things help. But at the end of day, like, there's a ton of intelligence in, like, just the closed models and I think it's not being fully exploited. So, yeah, all all areas to explore. That's the right questions.

Speaker 1: How has your team experimented with token maxing? Like, met you guys are, you know, heavily funded. You're building an AI Yeah. Law firm. I imagine you're not telling the team, like, you know, be be conservative with tokens, especially because you can just, like, learn faster and and it's kind of the whole point of, of the business. But where does token maxing make sense with with legal? Have you already found an equilibrium? What is your overall kind of, like, ethos around it?

Speaker 10: I feel like as a company, we're not, like, successful enough to be, like, worried about token cost yet. We're still in a place where we can just be, like, putting as much money. And I think more and more law firms, like Kirkland just allocated half $1,000,000,000 to building their own AI stuff, are just trying to figure out you know, they're charging they're charging huge amounts of money for the labor spend of legal. So they have a lot of room to throw money at tokens. And so long as you're, like, basically spending less than the equivalent of a thousand dollars an hour on tokens, which is even for this so hard to do, like, you're still coming out with a good margin. So, like, we you know, we're we're spending a ton. You know, we we we did a lot of work with Fable. We got early access to it, and and we spent a lot of money with it. And, like, you know and so I think there's still room to go because ultimately, it's it's either the model or the human, and and and the models are still cheaper than, you know, hiring expensive lawyers for some of the tasks that we don't think we need lawyers for.

Speaker 2: Yeah. That makes sense. Well, thank you so much for taking the time to come.

← Back to story