Prime Intellect launches RL environment hub to make reinforcement learning accessible beyond frontier labs
Aug 27, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Will Brown
hours and 31 minutes. Slept. I'm on a run. Got a 5-year warranty, 30 night risk free trial, free returns, free shipping. Head over to sleepp. com. I'll be right back and I will talk to Will. How you doing? Hey, how's it going? Great to be back. You're good. Jordy just plays a sound queue and steps away.
uh uh what's new in your world uh just uh yeah way uh what's what's the latest? Yeah, so about half an hour ago we did a big launch of something we've been building for quite a while at Prime and Elect. Uh so for those who are have not met me, I'm Will Brown.
I work at Prime and Elect for a uh open source research company as well as a comput platform. Um and we released something today called an environment hub which is for RL environments as well as eval. And uh it's something we've been kind of working towards for quite a while.
But I think it's also something that like a lot of people are talking about, but also like wondering like what does this even mean? Y um like I I saw like over the weekend like Google Trends for RL environments like shot up.
Um and you guys had made some joke tweets about this recently that I thought were great about like Ender, that's not an RL environment you were in. That's you place real Amazon orders. Um uh it perfectly encapsulates my level of understanding here. So, I'd love to go deeper.
Uh, it feels like a bullcase for distributed infrastructure generally because we're maybe like it's maybe it's additive, maybe it's maybe it's uh replacing, but uh certainly less of the focus is on like get a 100,000 H100s in one single cluster. There's more to be done in different places.
But try and concretize it for me a little bit more like like what is an RL environment? how are they being used? Uh what are the what are the what are the trade-offs in terms of compute and design of the infrastructure um that that actually delivers some sort of valuable product at the end of the day. Sure. Yeah.
So I think we can get to the distributed aspect of it but just like first I think like an environment is essentially it's an eval like a lot of the things people make as like popular evals like Sweetbench or like ArcGI like these are environments. They're a thing where you have some input tasks.
You have some kind of harness and then you have a thing at the end that looks at what your model or agent in the harness does and then says a score at the end. And so this is like the exact setup we use for evals, but it's also what you use for RL training.
And so when people are talking about, oh, we're like going to make a bunch of RL environments, what they really mean is they're making evals that are designed for like agents to uh do some task interactively. Yeah. is the Amazon order like we saw that report that uh some folks are building like full digital replicas.
I mean I guess it's already digital but like a like a like a a simulated environment of Amazon. com or Door DAC are are those kind of the power law opportunities that people are really focused on? Uh what are some other examples?
Are people um building these environments for like you know old old enterprise software systems that people want agents to interact with like how wide is this is the scale of RL environments right now? Yeah, it's pretty huge actually.
So I think I think there is like these kind of like flagship website sorts of things, but I think there's also like a long tale of like increasingly niche applications that go down towards like these very fine grain narrow things all the way up towards like broad domains where models really struggle because we don't have a good like interface for them.
Like I'd say Excel is one of it's like everyone uses Excel but like no one is really like an LM power user in Excel because it doesn't integrate very well. Um no we don't have the cursor for Excel yet as a broadly deployed thing.
Yeah, I think it's easier to have an environment for a terminal or for a coding agent because the harness is a little simpler. And so that's one sort of thing you'd want as like an environment and that's one of the sort of things that like the mechanizers of the world I imagine are like building these sorts of things.
Um I'm sure that like the labs all have their idea of like what tasks their customers want models to be better at. They're currently not being used for as well as just like what sorts of products they're people trying to build.
So like you can build the greatest agent harness in the world, but like if your model hasn't been trained for it, like some models might just like not be very good because it's not very ergonomic for them.
And so you can think of environments as like a way of taking a model like having a interface that you really want your model to be great in like a a harness and then just like a feedback loop that lets it get better at being a model or agent in that harness. Yeah. Help me understand the Excel case more specifically.
Uh Copilot exists there. We're hoping that Microsoft rebrands it as Clippy eventually. But uh there are also like seven different startups that are building uh you know cursor for Excel.
uh help me understand the context of verifiable rewards and what that would look like in Excel because if I if I'm trying to do an analysis, I'm trying to understand a company's financials, build a DCF, like there are templates, but then to actually get the correct valuation, that's something that like even if I have the perfect DCF and it tells me that Nvidia is worth five trillion, like I can't verify that against the market.
I could wait a year but the DCF could still be right and the market could be wrong. So um like what does verifiable reward look like? What are people actually building towards in that environment you think? Right. Yeah.
So the easiest recipe which uh like we've done work on as well as I've seen a lot of papers about this and it also just kind of like the obvious only real thing you can do is have a correct answer where it's like there is a gold standard like someone made this ECF y um and we the agent doesn't see the finished one and it's like new enough that it's out of the training distribution from pre-training um like past six months is perfect and then you want to train your model to like be in the harness where it could make the DCF and then look at what it produced.
And then you have all these different things you can check of like did it get all these fields right. Yeah.
Um and this those evaluations are a combination of like uh like vibe checks spot checks when humans are looking at it for like this is how kind of people have been doing historically is you just like you build a thing you try it out you look at it you try to make a list of what's wrong and that doesn't really scale.
to scale it, you need to automate this evaluation trunk um where you want to have a thing that comes out like a DCF, a spreadsheet or a paper or a piece of code and have a way of programmatically checking this and this checking usually is going to involve LMS in the loop maybe fine tuned LMS or customized LMS for grading whether the thing has been done correctly and then this like you got to iterate on this meta process a bit to have a good grader um and see like does the does the result of your grader match your human vibe check your spot check based on having some kind of experts in the loop.
But what you then want to do is take that kind of evaluation process and freeze it as a kind of piece of code and now you can plug this into your harness.
So the harness plus some like input task of like do this for every company um and your like gold standards for every company of like what a report should look like made by some analyst. You're grading against like the golden answer and now you scale this up. What was the industry standard before you rolled this out?
would would like the actual AI researchers or or AI engineers who were going to uh do reinforcement learning on a model in a particular domain actually design and implement the environment with the reward all internally and it didn't exist as an external function whatsoever.
Was that the status quo uh in terms of like open source infrastructure or like the in terms of like methods broadly very few people have been doing this the amount of people who have successfully trained a large scale agent model with reinforcement learning is very small yeah I imagine it is not broadly deployed like the companies that are like on the periphery of maybe doing it are like cursor perplexity yeah so it's not a thing that like yeah when I hear a lot of like we're an AI agent startup what that really means is like Maybe you're leveraging another agent or some sort of agentic API in that particular case and then maybe yeah you might be you might train your own um and maybe that's the is there a significant cost to this like does this affect if I if I just take the landscape of like everyone who's building cursor for X right now that implies that at some point they like if we see this bifurcation and the and the market for cursor for X really is you know highly frag fragmented and there's not just one foundation model company that just eats everything.
Um, are you going to see small RL uh reinforcement learning environments with verifiable rewards in every single subniche of SAS that these companies are going after and would they have like costs to that? Is it expensive? Uh, so build like I think you have to build the harness anyways.
I think one of the hardest things to scale is building good evaluations and rubric criteria. Y if you can do the evaluation bit um depending on like how so one thing is like the infrastructure problem of like doing this for like a frontier scale model uh is not broadly accessible.
It's not the thing that you can just like put in money and it comes out. That's what we're working on building. Um but you can do it with like these like tiny models like the non-mix.
So kind of technical thing is like to serve models at scale you really want them to be like large mixture of experts models that are efficient for inference. the current ecosystem for tooling for doing that yourself is not great.
Um most people are doing like llama and quen experiments and these models are like they're good models kind of but like they're not really what you want to deploy for a serious application. Okay.
So um yeah is that uh c can you give me a like a rough order magnitude for the for the level of cost that it would it would if I have a problem a specific domain where I think I need to uh train an agent in an RL environment. I have the verifiable reward.
my AI researcher has, you know, defined the harness and the reward and I go to you to actually do uh the the RL on top of maybe an open-source model. Am I am I talking about tens of millions of dollars of of GPU cost? Like like try and ground it for me a little bit more like uh like I would say this is like napkin math.
Yeah. Like if you want to do like a serious run that would like meaningfully improve a model at like the Deep Seek or Kimmy scale, we're talking like hundreds of thousands for like a big boy serious one. Um you can do a lot for thousands actually. So it's kind of like Yeah. Like those in terms of raw comput. Yeah.
Like when when people were like I fine-tuned llama on this thing and now it now it does this funny thing or I fine-tuned this image model on my face. like that those used to be like kind of like proumer level projects.
Now we're getting into the okay this is like a enterprise level effort but it's probably not going to you you don't need to call up SoftBank to get it done right I mean there's like a there's a whole spectrum of these things that's why it's hard to like say like what's the bucket because you can like you can fine-tune llama on your laptop but you also you can't fine-tune like Kimmy or DeepSeek the big ones on your laptop um and also it's like how long running is the task how many samples do you need how much do you really want to crank it like it's one of those things where you can kind of just like get more as you dump in more compute.
Yep. Um but currently people are spending a very large fraction of their total compute costs on experimentation and rebuilding the same [ __ ] Like there's not it's not easy to do this stuff like getting your GPUs to work correctly, getting your like libraries to work correctly, um deciding what hyperparameters to use.
like people spend and this is one of the reasons that like the labs have needed so much comput and why the expenses are so high is that like they have all these researchers and the researchers are all doing experiments and each of these experiments is like thousands of dollars of compute.
Um and you multiply this over a year and 100 researchers that's a lot. Um okay sorry uh you you can finish. Sorry. Oh sure. Yeah. Just that like I think there's some of these pieces that like once you can get the the hard questions answered once Yeah.
you don't need to keep redoing them for every new environment and you can kind of have some very cheap spot checks like like a kind of one shot eval like how good is the model at the start is kind of be a bell weather for like is the running going to work. Okay.
Loosely related to somewhat of the concept of generalizability. I want your reaction to this Rune T this Rune post. Uh Run said, "My bar for AGI is an AI that can learn to run a gas station for a year without a team of scientists collecting the gas station data set. " What do you think about that as a bar for AGI?
We don't need these RL environments for specific uh for specific tasks. What are your thoughts? I mean, one thing you might have that look like is the like I think that could be answered by an agent who realized it's not good at gas station stuff. Y and figures out how to kind of create its own training environment.
It's like, okay, I need to practice. The same way that like if you want to learn how to code, you got to go find stuff to practice on.
Maybe that's one way of thinking about what that level of AGI means of like truly self-learning, like build your own tasks and curate your own tasks in a way that allows you to check whether you've been doing them correctly or not. Yeah. Yeah.
It does feel like we're in this era where there are a few really obvious things that we want to RL on and get really good at. And you see that with the IMO and the math and the deep research reports and they're very reliable. But then once you come up with some random task, then you need the gas station data set.
And so yeah, maybe the future is is uh go and find those autonomously and and set up the reward function train and then iterate and bake that in. Um yeah, but sorry. Yeah, sorry. Go ahead.
Oh yeah, just say I think one way to do this is just like the environment can be the thing you're going to serve like cursor is an environment. Y um lovable is an environment.
Um all these things people are already kind of building things that could be environments if they connect the like the wiring correctly, but like I think we're not quite there yet in terms of having this be a thing that people are scaling and now like every company wants to like hire like environment builders and RL people.
Yeah. And it's like you're not they're not all going to hire people for this. We need to kind of better ways of like scaling this out to more people and like infrastructure as a service kind of. Yeah, that makes a ton of sense.
wanted to ask about uh Daario's comments in his interview with John Collison talking about uh basically I I'll read the quote. He says there's two different ways you could describe what's happening in the model business right now.
So let's say in 2023 you train a model that costs 100 million and then you deploy it in 2024 and it makes 200 million of revenue.
Meanwhile, because of the scaling laws in 2024, you also train a model that costs 1 billion and then in 2025 you get 2 billion of revenue from that 1 billion and you've spent 10 billion to train the model.
So if you look in a conventional way at the profit and loss of the company, you've lost 100 million in the first year, 800 in the second, and 8 billion in the third year. So it looks like it's getting worse and worse. If you consider each model to be a company, the model that was trained in 2023 was profitable.
You paid 100 million, then it made 200 million of revenue. There's some cost inference with the model, but let's just assume in this cartoonish example. Um, and so Finn Twit has just been like freaking out about this, like being incredibly bearish around things. It's obviously a b a bubble.
um you know and just kind of worried you have exponential uh increases in cost and then unpredictable kind of like capital uh needs in the future.
Why do you think they should stop freaking out given that you were just over at uh at Goldman uh Morgan Stanley but yeah so um like there's an element of the freaking out that's like kind of a good point which is like you can't keep doubling every year. Yeah. at some point there's a plateau.
Um, and you can grow every year. You can still have like exponential growth, but the percentages are not going to there's only so many people to adopt AI products. Um, and once they adopt it, like if they're going to be spending more and more, it has to be delivering like a very large multiplier of new value per year.
And there's some laws of physics of like how good models are going to get, how like fast we can possibly make the chip. Like the chips are only going to scale so quickly.
we can only build so many GPUs year after year with the current like rate of growth and so like I do think there will be kind of like the space we end up in where like stuff broadly works quite well there are like a few big labs who have very good models that are like you can deploy them and then there's also like this kind of paro curve of like how much can you spend uh how fast does it need to be how hands-on do you want to be for it um like how like uh narrow is your like thing you want to do.
So if your thing is like oh I want to build a great browser agent, this is super broad. There's so many things you can do on a browser. If your thing is I want to build an agent for writing Rust, this is much more narrow.
Um, and so like if you only if you're making like the product for rest developers like the uh cursor for example, you can make a model that's just as good as quad for opus that's much smaller and much cheaper most likely um because the the focus is like more narrow.
And so this is kind of how I think about generalizability is like you can spend more on compute, you can make the model bigger. These are all different axes to spend and you can also make it like more narrow and go deeper on one thing. And so we're going to have a lot of knobs to turn I think. Yeah.
Are you broadly long RL environments, short normal pre-training data? Do you have a do you have any like extra color on the relative value or like where the lowhanging fruit is across those two? I'm long both, but I think they're going to kind of turn into the same thing.
Like like one thing for example that people have been doing for like pre-training recently is that people are just using like deep cigar one and generating tons of like reasoning samples from it and then mixing those in at the end of pre-training and calling it mid training and it's like is in some sense this is RL because it's just taking the juice from doing RL on like deepse 3 but it's also pre-training.
Um, and so I think one thing that we're excited about with environments and scaling these is like you can get a lot of good data out of these by you have a like a task set. You have the ability to generate agent data inside of this task set. You have a filter for throwing out the bad stuff and keeping the good stuff.
And so can you just like do trillions of tokens of this and put it into pre-training? Why not? Um, do you have insight into what's happening in the video or image world equivalent of RL environments, verifiable rewards, etc. Like, uh, I I was just shocked by the level of quality of text in images and chat GPT.
Uh, we're seeing the same thing with Nano Banana from Gemini. Uh, and it felt like it felt like images in Chetchup particularly was like uniquely good at Studio Gibli and uniquely good at text.
And that felt like I was I was almost like putting on my tin foil hat and saying like they're using Photoshop here and they're doing two different layers or something or like there's something else going on here that's just like a re one really good model. Like it feels spiky.
it feels particularly good at text and particularly good at cartoons, but it hadn't gotten like way way better at, you know, super photoreal imagery or whatever.
Um, so do you have a do you have any view on how you translate all the stuff that's happening in the in the agents and and textbased LLM world into sort of the diffusion and uh and image or video world? Sure. Yeah.
So like I am not a diffusion expert or anything but like I I imagine it's a mix of essentially these kind of environments where you have some greater as well as like good old fashioned RHF like the stuff people were doing for like instruct GPT jackp era where it's like you have like up vote down vote um and then you're training a roar model to like check these and I think the image domain is easier to kind of spot errors whether it's a human or an automatic grader like if text is wrong you can like do OCR on the image get the text back out and see Is it the right image from the prompt?
Um, and so you it's this is like pretty easy to verify versus like if a chat GBT answer is like slop, how do you verify that it's slop? Like what's the algorithm that checks if it's slop or not? Like that's pretty hard. Yeah.
And so in some sense like the image models were just like for a while they were bad at things that we could very easily measure. Um, and it's getting it's gotten harder to measure the things LMS are bad at much faster.
Um, and so I think some of this is like the oldfashioned tricks that made Jaxvt original version as good as it was being applied to the image domain for kind of patching up the the obvious fixes. Uh, give me your hot take.
How much uh how much do you think Meta is paying Midjourney with that new deal they they announced? It's a good question. You think it's I do you think it's nine figures a year? Hopefully. I mean like a feels like Feels like it would be more maybe. I don't know. They're spending crazy amounts on everything.
That's what I was saying. And it makes sense for MidJourney to want to have like a low-key announcement about it, but there's like another team that would have been like, "We just signed a $500 million a year, 10year deal, or something something obscene obscene that uh that looks more like an acquisition.
Even though clearly uh from a from a dollar value standpoint, even though clearly um they're going to continue to operate independently. Yeah, I would kind of think of it as like midjourney's version of doing an API business.
So like it could just be based on usage to to I'm sure there's a big picture component of it, but like think of like what you can kind of do the napkin map by saying like okay, what are other image providers like charging for API usage?
um like Val or like uh um replic replicate or like all these other services like you can kind of map and math the cost or of an image or back it out from mid- during subscription prices and how much they give you and then say like okay how many of these are these people going to be doing if they're doing like generative ads for Instagram that's a lot if they're doing like every person has the ability to apply it to their Instagram posts or their stories that's a lot and so you can the numbers can get crazy pretty quickly and I think it does kind of depend them like like if it's just ads or if it's just to like certain customers or it's not like deeply integrated then it like the scale is not as crazy as if it's like no you're really getting like midjourney stuff everywhere for everyone instantly at volume.
Yeah, I was uh I was thinking when uh Studio Gibli moment happened uh that the response from Meta should be to pre-render or pre-generate a Studio Giblly of every single person on Instagram's profile photo as a Giblly because that's basically what people were doing in Chat GBT.
Um like pre-generate that and then just when you when you open Instagram it just says, "Hey, we did this for you. uh do you want to share?
And that would probably be like the biggest day of usage on Instagram, but it would also probably bankrupt the company because it' probably be like $50 billion worth of inference or something like that. I don't know. It's a lot of those images are expensive. Yeah. Got some breaking news.
Nvidia beat on revenue and earnings per share, but is down 5%. Wait, why? Um they didn't beat hard enough. They didn't beat hard enough. Can we ring the gong for the beat? I mean, let's bring it's Yeah. four and one respectively.
Nvidia's Nvidia's uh like a both a competitor and a supplier partner, right, to you like Yeah. I mean, we sell we resell the GPU, so like we like we're big Nvidia fans. So, you're rooting for them. That's great. Yeah. And I think Nvidia has been very friendly to like the ecosystem of players.
Like Nvidia is not trying to be the like single you can even like buy GPUs on their website. Like they don't sell GPUs to people. They they sell them to like data centers and like big companies. Yeah.
And I I know that there was that news that they were coming for the Neoclouds with DJX Lapon and it felt like it felt like they were dipping their toe in that area, but it didn't feel like this was existential for anyone else in the in the ecosystem. It felt somewhat additive.
I don't know if you have a take, but Oh, yeah. So I mean I think it's also like a different like they're going to kind of offer that as like a very premium like white glove sort of thing to certain enterprises. Sure.
I think we are much more on the end of like get all the data centers we can find like partner with every Neocloud and have like really uh cheap pricing and then like build features on top. Yeah. Um where part of this is like doing like core research.
So like we like have friends at Nvidia like we talk to them about like they're they release a lot of cool like open source um like research stuff and so like that is like very in spirit of like the sorts of things we are like aiming to do and I think we're like friendly. Yeah.
Uh are there any RL tasks that you think are like truly intractable? something that like is is fundamental to humans creativity or comedy or something like what what what's the Mount Everest of of developing an RL environment around?
I think the hardest is stuff that and a friend of mine was tweeting about this earlier today actually um one of our collaborators who uh from Evals who's doing some cool open source eval work for computer use but things that need a human in the loop to have a like fine grained accurate simulation like stuff that if a human if models are not good enough at replicating human behavior yet and that human behavior is key to the environment then you're not going to have an environment that faithfully captures the task because there's a kind of chicken and egg Um, and so like Twitch streamer was one example of like um like having a model that can like be good in Twitch chat kind of requires like an accurate sim of Twitch chat.
Um, and it's like how do you build that? Or like a physical streamer. Like there's a this real time interaction problem you need to kind of solve before you can get started because the the scale at which these things move, you can do it if an LLM is simulating the user. Yeah. You can't do it if you need a real human.
Yeah. Yeah, I feel like the uh the the time horizon is really it feels really really tricky to simulate when you think about like the the the impact on a life of a certain behavior. I mean we struggle with this with like drug development understanding like does something in childhood affect you as a retired person.
It's like you, okay, now you have to create a simulation of the entire human body to understand that if you did this thing at 18, it's gonna cause you to, you know, uh, have your like knee blow out when you're 65 or something like that. Feels much harder to simulate uh, one shot. But maybe maybe we'll get there.
Who knows? Yeah, at some point you just got to let the error of time run forward and see what happens. Exactly. Well, thank you for hopping on. Always great to see you. Congratulations on Check out the Prime Environments Hub. We are live now. uh or on Twitter X. See you around. Send it, chat. See you. See you guys.
Talk to you later. Bye. Bye. And yes, uh we have the Nvidia uh market chart here down uh 2. 75%. Is that what I'm seeing here? Correct. Um let's see how Nvidia is doing as well. Um but uh high expectations, but you know, after hours it's down 1. 42%. It's bouncing back right now.
The company's trading at $179 a share was at $181 a share. Something tells me Jensen's going to shrug this off. If you look at the