LM Arena raises $100M to become the people's voice on AI model quality

Jun 6, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

soon. We have Anastasios from LM Arena. We're going to get to the bottom of what the best LM is. What the best LLM is. Welcome to the stream, Anastasios. How are you doing? Pretty good. How are you? I'm great. Uh, congratulations. Uh, you have news.

Can you first introduce the company, uh, the news and, uh, and, uh, everything that's going on in your world? Sure. So, Ella Marina is an open platform for evaluating AI via human preference. Um, what it is is you can go to the website lamarina.

ai and you can type in your prompt and what you'll see is responses from two anonymous LLMs. Uh, and the LLM are, you know, some of the best Gemini from Google and Chad GPT from OpenAI and Claude from Anthropic. And they're sort of randomly sampled from a pool.

And not only do you see the responses, but you can also battle them against one another to see which ones sort of win for your vote. Um, and then if you choose, you can vote for the one that won. And then we use that data in order to construct leaderboards.

So we have a leaderboard of, you know, the performance overall, but also in subcategories like math, coding, instruction following, creative writing and so on. Yeah. Uh, and we sort of see it as like the people's voice on AI progress, right? Because you can go there, you can express your opinion.

A lot of the tasks AI is doing is subjective. Um and by doing so you can sort of guide the field. Yeah. So you raised money recently. Is that correct? Yeah, we raised 100 million. Wow. Uh in a round that was uh Yeah. Thank you so much.

Um it just speaks to the strength of our community uh and our duty to continue to improve the product for them. Yeah. So now Yeah. What is the business model? Because I think people might not be so familiar. Uh there's a lot of benchmarks out there but you're actually building a huge business around this. Absolutely.

The business model is to help you know one of the biggest problems that we see with AIdriven software today is people just don't know how to build it reliably. They don't know how to pick the right model that to use for them.

They don't know how to build like an agentic system with multiple subcomponents in such a way that they're going to be able to ensemble it together into a piece of software that's like reliable and performant. And it's hard even to define what these words mean because a lot of the job of AI is to interact with humans.

Mhm. And so when you're interacting with humans, you know, a lot of the responses people give are subjective. So how do you make sure that you are using your models in a way that like ends up satisfying your users, ends up uh making your product better and better?

You know, we're hoping to be able to address that problem because we have such a big community, millions of people coming to the site and giving their preferences and we have a whole competition landscape of all the different models against each other on all sorts of diverse tasks, not just in tech, but also in, you know, medicine and real estate and school and so on.

Yeah. So talk to me about the death of bench of traditional benchmarking and the rise of big model smell. Have you been able to quantify the unquantifiable yet? Sure. So I wouldn't really describe it as the death of traditional benchmarking.

I would just say that um benchmarking is entering a new age where there are different types of benchmarks that are needed. So what does a traditional benchmark do just zooming out? What you do is you say hey here's like a particular vertical that I want to evaluate. Let's say I want to evaluate image classification.

Well, then I'm going to collect a bunch of images. I'm going to classify them and then I'm going to basically give models a test. I'm going to grade them based on how well they do on like a held out set of classified images. So, what is the problem with this in the current day and age?

Well, these days, the way people are using AI is like so broad that you could never like annotate all of it with data sets. Like you could collect a data set, it's it's going to evaluate that thing, but not everything. And you could collect 20 and that's going to, you know, evaluate 20 things.

But really there's like a million different things that you should be evaluating. And you know, arguably even every individual has their own point of view on like, you know, all these subjective tasks. And so that's why things like Alamarina come in as sort of a different orthogonal signal.

Yeah, maybe it's still useful to say how well is my model doing on image classification or multiple choice question answering.

But maybe I also want to know, you know, whether people like this model, why they like this model, whether they're using it in the ways that I intend it to be used, which and which parts of the space is it good and bad.

And our perspective is those are the questions that we can answer by gathering a massive data set of human preferences and then going backward basically inverting the problem and saying let's mine that data for all the analytics that we can get so that we can tell you which model's best for you and for your use case and that's where I think the power is in this sort of new age of benchmarks.

What about humor? Uh can we expect uh any type of uh benchmarking from LM Arena around uh just making the user laugh? That's a great idea. We should have a leaderboard for that. Absolutely.

I think I think that that's I I think that that's like a a potential trap that humanity can fall into is is LMS get so good that you just refresh the button and laugh over and over and over and over and then and then you know all productivity is lost and it's just the laugh button, you know.

Hey, if you're if you're making people happy, you know, making people laugh, that's not a bad thing. Yeah. Sometimes I I also scroll YouTube like looking at my favorite comedians and I think they're they provide me value. Totally. Yeah, I have I have a post here I want I want your reaction to.

I said, "Now that AI can beat every every intelligence test, we need new evals for the other human traits. Specifically, I'd like to quantify courage, fortitude, gallantry, mirth, equinimity, yearning, stout-heartedness, gravitas, panache. This would help me pick what LLM to use.

How close are we to be able to being able to uh quantify courage in an LLM? I love that so much. We need your vision. Yeah. I don't know how close we are to courage. Maybe we need them to we need to up the stakes compete in a real arena where they have some risk. We have some chips on the table, you know. Yeah. Yeah.

Our our intern said, uh, you could probably do this using control vectors. And I said, uh, I don't want to put the courage mask over the shog. I want to understand the base level of courageousness for the shog as a whole. Is that still a So, we went back and forth on that. We're getting we're getting biblical here.

I love it. Yeah. How how where where where are you guys going to be? You obviously raised $100 million. Where where are you going to be investing? What what are the kind of some things coming down the pipeline that that you're excited about?

Well, the majority of our effort these days is going towards taking the platform that we already have and making it a lot better for our community. Basically inviting more people in because as you can imagine, how does our platform work? Like people come to the platform.

At this point, millions, but we're hoping tens of millions or more of people will come to the platform and vote for the ads they like best. And then we take those votes and turn them into a number, right?

So what's the most important ingredient here is getting people into the platform and loving the platform to play with all these different AIs and like ranking them and toying with them and and so on and so forth. And we want to build great product features for them so that they continue to continue to love it and come.

So just investing in Elmarina. ai, that's like where we're really targeting a lot of our energy. Uh we have Dylan Patel coming on next. He recently dropped the AI mandate of heaven tier list. He's got Open AI up at the top. Uh followed by Anthropic, Deepc, Google, XAI down at B tier.

Uh Meta and Apple are struggling a little bit lower down in Dtier and L tier. Uh does that track with you? What have you been most excited about in the recent developments out of the foundation model labs? Uh what what what uh what do you think's like underrated right now in in uh amongst the various foundation models?

Well, I think uh in general the thing that excites me most is that there's a lot of competition these days and there's also a lot of new modalities that are coming out. So like you know all the time we're seeing that the models are improving improving improving and the pace is not slowing down.

These things are getting way way better and there's a lot of the most brilliant people in the world that are just here like in the crucible developing like the best AI models and then we're going beyond text right it's going from text to people are now doing audio and people are doing video and I don't know there was a you know a big video model release you know last week with V3 that was huge um and are are you already doing uh tests against the different video models because right now I mean I' I've put the same prompt into Sora and VO3 and it wasn't even close at this point.

Obviously, the next iteration will be maybe more competitive, but it feels like we're we're kind of too early to do those types of evaluations where people would just immediately identify a V3 video.

But then there are some other labs out there like Runway that are doing cool stuff in video and maybe it's harder to tell than people think. So, we have an sort of initial cut of a video arena that we're going to continue improving that was built by some uh awesome graduate students at Berkeley like Yeah.

So, we have that we're going to continue building it. Um, but you know, I think one benefit of some of these strategies like arenas is that they can actually provide kind of a a touchstone for the field to come together and say, "Hey, here's here's what we're going to try to improve. Let's go.

" And then maybe it would motivate some labs to actually continue to like build better and better video models because they want to, you know, they want to be winning the game. Mhm. Um what uh what has been your take on the the the llama journey? Uh there was some accusations of like benchmark hacking.

A lot of people have been saying well it's you know like never been against Zuck. He he he has a he has a capital cannon and a huge incentive to stay in the game. Uh and so we could be seeing some exciting things out of uh meta in the coming weeks or even coming months. I mean, all this stuff moves so fast.

The leaderboards are are constantly shifting. What's your reaction been there? So, I would just say that um we should all be gunning for Meta. We should all be rooting for them because they're releasing open models. Yeah. And so it I think open models are really great for the ecosystem.

So I think I and everyone else should want them to succeed. Now, you know, there was a little bit of weirdness in the last release. I think it's okay. you know, they have a big group of really talented people. They're going to move on. And you know, the models they released, they're good. They're getting better.

Um, and I hope the trend continues for them. Yeah. How is this AI safety debate shifted recently? It feels like Meta almost could have had a little bit of an out by, you know, years in years past when a model wasn't performing uh up to standard.

You could never really tell if the Foundation Labs were uh really worried about safety or they just didn't want to release the product yet. it was an easy kind of out. We didn't see them take that this time. We didn't see them say, "Hey, we're we're we're delaying because of of uh of safety concerns.

" Uh has that kind of melted away from the discourse in your community or do you think that there's a world where what you're building could be used to evaluate safety because there's so many different dimensions of safety now? I completely agree and we definitely want to make safety a priority in evaluations.

We have a red team arena now that, you know, has been out for a little while and it's still, you know, it's still a bit of a prototype, but we're going to keep working on it. Um, and I think it's really critical.

I don't know about like, you know, the initial question about, you know, whether or not that's used as an excuse or not. That's kind of I don't know that might be above my pay grade.

I don't hear a lot of those internal conversations, but I certainly think that like, uh, subjective evaluations are going to be important for safety. Jordy, anything else? No, this is great. I'm excited. Yeah. Thanks so much for coming on. Congratulations on the milestone. Excited.

Great to meet the two of you and keep in touch. Yeah. Yeah. Come back on again soon when you have news. See you guys later. Bye. Cheers. Uh really quickly while we have our I feel like humanity has a duty to everybody should be required. It should be forced. You got to go on Marina daily, five minutes a day.

Every every person on earth. What do you got, John? Uh we got bezel. Get bezel. com. Your bezel concierge is available now to source you any watch on the planet. Seriously, any watch. And our next guest is here, Dylan Patel from Semi analysis. We'll bring him in. Welcome to the show, Dylan. How you doing? Boom.

What's going on, dude? This live audience is fantastic.

← Back to story