LM Arena raises $150M at $1.7B valuation to become the neutral benchmark for AI model evaluation

Jan 12, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

guest, I'll also tell you about Lambda. Lambda is the super intelligence cloud building AI supercomputers for training and inference that scale from one GPU to hundreds of thousands. So up next, we're going over to LM Arena.

Anastasios, welcome to the stream. How are you doing?

Hey, brother. How's it going? Nice to see you. [music]

It's going fantastically. It's going extremely well for you. You raised some money recently. Let's kick it off with a big gong smash. How much did you raise?

150 million.

I have all the good stuff. Fantastic.

Great stuff. Great stuff. Uh okay. So, our friend was just joking around this account near they were saying LM Marina raised uh at a $ 1.7 billion valuation and they were using the Michael Bur.

Very rude. Very rude. Very rude. Very rude. But they were just joking. But it's okay because you're here to tell us why why that's a deal of a lifetime.

Yeah. Explain.

Listen, evaluation's a big problem. Evaluation is a big problem. People don't know how to solve it. You have all these different AIs. The space becoming more complicated. You have different modalities, different models that are doing all all these people competing in coding, competing in, you know, software engineering, competing in the general chat. You have models from China, models from the US. Who are you going to pick? Let's say you're a consumer, you're a developer.

Even even a business, you're going to make a decision that you're going to the you're going to make a decision. You're going to spend millions and millions and millions and millions of dollars on that. You got to make sure you make the right decision.

Yeah. Procurement procurement is a big market. And you know all these problems are are huge. I mean and and the ultimate problem is how do you help people find the best solution for them. Um and as a neutral platform of course it needs to be separate from the labs. It needs to be a third party that does this. Uh we're positioned well to attack it.

So I imagine I mean independent uh we were joking around that if you ever wanted to return the capital you could get one of the labs to pay you10 billion just to put them permanently hardcoded at the top of the rankings. Uh you'd be out of business in 12 months, but you'd walk away with a pretty penny.

Well, [laughter] that's the thing.

I don't think that's what's happening. How do you make money?

Value if you do that.

Yeah. Yeah. Exactly. Exactly. So, how how do you make money then?

Well, what we do is we help labs and enterprises get analytics on how well their models are doing. Help them understand the strengths and weaknesses of their models in different domains like math, coding, instruction following, multi-turn. Think about it like a full body scan of your model.

And what distinguishes us from the standard benchmark is that it's all live. So you can't overfit it because there's constantly new data coming in.

Oh, interesting.

Um, and it's on real users. So we have tens of millions of real users that are coming to Alam Marina using A for like this huge diversity of different tasks. Uh, and that's of course the data that's used as the basis for these analytics.

Yep. Um, talk to me about the proliferation of different LM arena categories. I imagine there's a exponential curve there as different models have different capabilities. We're not just testing math and science reasoning and chatting ability and there's so many of these. So, how fast is it growing? How are you setting up your system to handle potentially tens of thousands of different benchmarks and models like how is that growing and scaling?

That is a great question and I think that we're honestly still developing what that you know big road map looks like.

But that being said, here's where we're starting. Um, we have a bunch of different categories. I mentioned a few earlier like math coding instruction following but we've recently also dug in a little bit more to look at different types of users basically almost like user research expert users what are experts saying about all these different models we have a leaderboard for that occupational categories law medicine

you know business which models are best for these different categories of usage marketing we have that information on Alam Marina so it's a one-stop shop for you to understand the different industry economically valuable industry level performance of these models but if you really Think about it. The long-term vision of this sort of thing. It's almost like every every individual should have their own evaluation.

Which model's best for you. You know, the technology brothers,

technology [laughter] brothers, eval.

Yeah. [clears throat]

It needs to have its own kind of thing because you're you're doing your own tasks and they might be very different from what I'm doing. And so eventually the future you can imagine of building a technology like this, we're building the infrastructure to do this is to have evalu. How do you incentivize people to participate on the voting side, the ranking side? That human element is really important in terms of grading different LLMs. Uh what's the current incentive structure? How is that evolving? What are the challenges that come up with uh running a multi-sided marketplace like this?

Very good question. Uh incentives are really really important. If you think about it, incentive is almost like a little hack into the human brain that tells you how to move, how to operate.

Yeah.

Right. So that's why games are so powerful. Games can get you into an incentive system that's just like, hey, I'm Flappy Bird or I'm doing my Candy Crush and I'm just swiping, swiping, swiping,

we do not really incentivize users on Alamarina to vote. And that's part of the power of the platform. Yeah.

Is that unlike let's say we were to pay people to vote, we don't do that.

The reason we don't do that is because we want them only coming to do their real job. They come because they get value out of the platform and they vote only because they want to because they're intrinsically motivated because they got something to say.

You can imagine if you in incorporate incentives that it might hurt the leaderboard. So we're very careful about the way that we go about that. However, we are exploring how can we design the incentives in sort of incentive compatible way.

Yep.

In order to preserve the integrity of the leaderboard while also rewarding people to vote. That's on our minds but not yet done. How confident are you that you can encode big model smell into a uh you know the taste, the vibes into a ranking?

Well, if you look at the leaderboard, it's there. It's there for you to see.

Okay.

Yeah. I mean, it does reflect it

and it goes to show how powerful human preference can actually be that people are seeing things that you wouldn't necessarily anticipate. Mhm.

And that's why when you know when model developers develop their model, the big problem is when I put it out into the real world in the wild, what's going to be the performance like?

Mhm.

I don't know unless I see it. And that's because people react in these, you know, strange ways when they see a model for the first time. That's part of the power of this platform is that it puts it in front of those people and then we give analytics to try to understand how different individuals, you know, what are the different usage patterns and who who's voting for what and so you can code things like big model smell identify, you know, it's not uncommon that model providers will find out through us that their model's really great at math or really great at coding and they didn't even actually ever know.

What does it take to get a model on LM Arena? Is it something that the lab is giving you preview access to? Is there an application form? Do you already have relationships with these folks? Is there

So, we have a contact form on our website that you can use if you want to, you know, submit a model. Of course, we have capacity constraints, but we try to be judicious and let you know everybody everybody participate in the battle. Um, and to um, yeah, in terms of paying,

someone was asking Deep Seek

to get on the leaderboard.

Deep Deep Seek R4. Well, uh, what does it take to get a model like that on the platform?

We'd love to get Deep Seek R4 on the platform. I don't know, maybe it's already in progress.

Okay. Yeah. Um, how much does hardware matter here? Uh we were just talking to Andrew from uh Cerebrus about the speed that comes from that. Uh there's an element where uh plenty of people would rather use a dumber model if it's 10 times as fast or vice versa, right? Uh and so you have to create an applesto apples uh situation most likely or at least uh pre-cache the results so then you're not noticing speed. But do you think that speed tradeoffs will become a bigger piece of what you do? How about that? Yeah, there's no question about it. Really, what the the ultimate question that people have is how do I

for myself or my organization or whatever

choose along the purto frontier of speed, performance, and cost.

Yeah.

Right. Those are the three things. And usually it's performance and speed number one and number two.

Yeah.

Because everybody's spending so much money right now.

Yeah.

That it's like, you know, ring the gong again. So much money. [laughter]

Why not?

There you go. Go to gong. Money being spent, baby. Now the issue with that is

there's gong twice. Let's go. The uh the speed if you don't equalize it, you can get bias in the ratings. Yeah.

Right. So we want to be able to disentangle speed versus performance. And that requires us equalizing. Sure.

But if you go to direct chat mode, you can use the model see it just as it is. And we're planning on expanding the leaderboard to get more of the speed in there, the latency um as well as cost so that people can make all those trade-offs right on our platform.

Yep. That makes sense. Would you ever do anything at the application layer or does it just get too chaotic because you know UI is a new variable and there's all these other factors?

I mean this is Alam Marine is an application. No.

Yeah, sure. But I'm I'm talking about like you know two legal AI tools and and trying to because if you go if you go down the procurement route and you're and that becomes I don't know I don't know if that becomes a big

yeah you could see like a G2 cloud type of business here as well. Although I have no idea if that is makes any sense for you but

totally understood but the so the thing is that it's hard to do that for our users right because then we need to build 10 different product services or 100 different product surfaces and that's one of the exciting things about the ecosystem right now is that it's actually the fundamental product surface where uh things are evolving very quickly and also a lot of value is being aggregated right there's a reason why you know number one revenue stream of open AI is consumer

right it's a consumer application

um and you While other companies have different strategies, but ours can't simply cannot be to replicate every application.

Yeah.

So what the value that we hope to provide again is to help people understand the different trade-offs of different models, evaluate them for their use cases, procurement, so on and so forth. And that might mean helping enterprises link together with their feedback um you know understand their users better perhaps you know warm starting from the large user base that we have uh and giving them those analytics and tools to help them make decisions. That's how I can imagine us moving into the application workflow.

Yeah.

Will you ever create a romance benchmark?

How I think it's a good idea.

How quickly models make people fall in love.

Comedy benchmark would be

Let me ask you, are you are you asking that for any specific reason? [laughter]

Got them.

Got me. We talk about we talk about comedy, you know, comedy bench. That's it's something we we try to test internally. Uh uh it they usually end up being they usually end up being funny because they're so not funny that it's that it's pretty brutal. It's pretty brutal.

Brutal. Or or or if it is good, it's like recycling. Like clearly it went to Reddit, copied the top joke, and then just regurgitated it.

Well, did it ever make you laugh? Has it made you laugh before?

Yes, but accidentally, but when it's accidental, it's amazing. Uh it's the best. Yeah.

Where did the idea for this come from? Like what was the inciting moment to start the company? Well, the so there's the to start the company and then there's the idea. I've been personally working on Alam Marina for almost three years now along with Wayin and Yan and there was actually a big group of students at uh at at Berkeley. Sure.

This came from an academic project. I was doing my PhD on like theoretical statistics, theoretical machine learning,

you know, kind of abstract. I was in a basement proving theorems. I had no idea that [laughter] I would be

He built a basement of scraps. Scraps.

He built it in a cave.

Seriously, there were rats rats in the ceiling. No way. You know, not kidding.

I'm glad you have $150 million for a new office. Hopefully, it's nice.

Go Bears, baby. And so, what ended up happening is that I got looped into this project that was happening at Skyab, which is, you know, Yan and Joey Gonzalez and, you know, Luca and all these people at Skyab were working on it.

Um, that's the lab where data bricks came out of and any scale ray and so on.

Yeah. and they were working on like how do we evaluate early days of chat GPT you know there was model A and model B they were doing one was doing better than the other on the benchmarks then you would chat with them

and it's like hey these benchmarks are not reflective of how good it is at talking to me

so how are we going to measure that was in the context of one of their early open source models called Vikuna that they had developed

and so the the sort of pair-wise preference hey chat with the models and see which one you like better that strategy emerged from there but it started with Amazon gift cards. It didn't start from organic usage. It started from passing around Amazon gift cards to people at Berkeley

and then it sort of, you know, popped and then it went down and kind of sat at 30 users a day for a little while and then it grew and grew and grew in large part due to way sort of consistent effort in Twitter and so on.

It's awesome

up until the point where it became a this juggernaut,

a true overnight success to see it over here and thank you so much for taking the time to come on. Congrats on all the progress.

Hey, great to see you guys. Thanks so much. Have a great day.

← Back to story