Prime Intellect launches 'Vibe RL' platform to make reinforcement learning accessible to any developer

Feb 23, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

How's it going?

Welcome to the show, Will. How are you doing? It's been too long.

I'm doing great. I'm doing great. It's uh it's great to be back. I think this is the fourth or something like that. I I looked at there was a list at some point of the record and some people are like have been on

some people have been on on something like

I'm looking I'm looking forward to the 400th.

It's great. It's been a lot. Uh what are your your old buddies at Morgan Stanley thinking about uh the current thing in tech, the uh the 2028 uh intelligence crisis? Have you gotten any messages?

That's a great question. I have not had the full deep.

They're too busy. They're too busy hitting the cell button.

No, everyone in Morgan Stanley is too quickly too busy setting up Mac minis to run open claw. That's what's happening because they all just read something big is happening,

right? And so like I think there's definitely a lot of opinions on on all sides and I feel like that to me the piece was pretty cool. I don't necessarily like agree with it but I think it was effective at getting people to have more interesting conversations than for example recent other maybe viral pieces about how everything's going crazy. Um, and it seems like the the conversation ended up getting like into the weeds of monetary policy and like how people are going to react and like how hard is it to vibe put a Door Dash clone and like these sorts of things I think are actually like

the sorts of conversations that are good for more people to be having like whether or not a certain prediction is right. I think like like it's just generally

like as stuff is getting crazier. I feel like this is the sort of stuff that allows the rest of the world to kind of like hear about from their friends like a little more grounded discussion about what could happen.

Yeah. Um well, give us an update from Prime Min. What's going on in your world?

Yeah. Yeah. So, there's a few things that I think are interesting as well as like I want to talk about just today given uh some other stuff that's happening on the timeline. But uh so couple weeks ago, we released a training platform to make it really easy for people to do RL on top of leading open source models with their own environments. And we've tried to really make the make it an agent native experience where you're kind of like there was a uh some a lot of people have been kind of tweeting out their experiences with the term people have been using is vibe RL which is uh we're kind of now at the point where the infra to manage the training is kind of in place and you can do it without thinking about the hardware and the GPUs where the models still kind of struggle but you can you can really focus on like the environment and designing your tasks and specifying what you want and having turning existing data that you already have into kind of training recipes. And so we're kind of at the point now where like this is pretty accessible for people to kind of go train models and it's been pretty cool kind of uh seeing uh people have fun with it.

Yeah. Concretize some of the like actual applications. I imagine this works best if everything flows through text, flows through CLI tools. Like because when when I think like, okay, great. I'm going to set up an RL environment and automate one of my workflows, and I'm like, well, I'll need to open Adobe Premiere, which has a license, and then I'll need to go to YouTube and download some videos. I'm just thinking about like editing a short video that we have, right?

Of course. Yes. Some of that stuff definitely like there's definitely a range of like simple to complicated. Yeah. Um, but I think there is a lot of sweet spots where it's like the it's coding tool use and interacting with kind of app simulators which are the sweet spots of a lot of these like focuses for people doing training in the labs anyways rather than like full-fledged Photoshop. Um, there's a great tweet actually from uh figured the ramp guys from ramp labs uh showing off how they had been using it for some stuff. it was the the team behind ramp sheets and so you could kind of imagine like the sorts of things there um where like you don't actually have to build the whole thing but you can like

and this is where I think the coding agent stuff is really useful is there's a lot of ways you can kind of have the right simulation of the application that is like it isn't necessarily the full back end but it's enough to be able to capture the task. Yeah.

Um and the coding agents are good enough that

with human in the loop using like it's all CLI native. So you go into your terminal, you we have our CLI, the prime CLI. Um, and you use that to kind of like

set up your skills and get your coding agent configured and your agents MD. And so we've tried to like make that really smooth, but then you just are kind of like talking to your agent about, hey, my data is here, my app code's over here. Let's put it all in the right place and kick off some runs.

Yeah. Where are we on the path to personalized RL? I'm I'm I'm thinking back to like the RL that went into RLHF around chat GPT GPT4 and I remember it was like they had maybe tens of thousands of contractors sort of grading responses giving thumbs up thumbs down giving feedback varying levels like a really large scale generalized process.

Yeah. If I'm running a medium-sized company, is this something that I can pull from logs of what's happening in the business? Should I be firing up a data labeling company to help me generate more data? Because in the long term, I would love just a screen recorder, watch me what I do, and then it's RLing and it's getting better and all of a sudden it can just do what I do with just like a one prompt.

Yeah, it's pretty close. Uh it's definitely it's that's not it's not that sci-fi like the if you have stuff especially let's f if we focus on like

uh text or image input full screen recording gets a little tricky. Um but if it's like text or image input that kind of comes from like agent logs and it's like human inputs text and images to an agent log and you have these logs and you're trying to synthesize these logs. Um I think the the trickiest part is refining criteria about like what counts as good uh for like rescoring another tribe. But a lot of times the criteria are like either pretty general across tasks or you can infer a lot of them from a user's response or just from the initial prompt. And from what we've seen, it's especially for a lot of like very concrete problem solving use cases more so than let's say like creative writing. Yeah,

but for things where it's like there's a right answer and it's not too hard to see if the model like got the right answer uh from doing an agent trace either from the humans response or let's say that if companies want to have their humans like label as part of using the product by default, yeah, this is doable and the the RL recipes are kind of stable and scalable enough that like it kind of like it doesn't always work, but it works reliably enough that it I think the barrier to entry and the cost are just like at a point where it it's cool to see that this is now a thing people can go do. Yeah.

And we're seeing a lot of people like have success with it.

Yeah. How are you thinking about the debate between MCP and uh CLI? Uh Peter was going back and forth and it it I it was something that I was wondering even when MCP came out. It seemed really cool, but at the same time it felt like well the front end and I I I remember going to the front end and inspect element and see what's coming across and oh there's an HTML request right there. Let's reverse engineer that. Yeah, it's all kind of the same thing. Like it's it's sending requests. And so

I think people realize models were good enough at coding that

the skills are essentially it's doing the same thing, but it's just you have more flexibility to

um like I think the area where MCP makes the most sense is when you really want fine grain like O stuff going on where there's like

kind of credentials and you want to be able to kind of notice that it's being done and have the user like approve certain requests or not approve others. That's where like the formalism of like the tool call is really useful as opposed to it just being code that is has an API token.

Um but from the perspective of capabilities like skills are nice. MCP has its areas where it makes sense. Um but it's really just models using computers.

Uh whether it's MCP or code or skill files and reading docs, it's like models are pretty good at like reading stuff and if it has instructions on how to do a thing, they can kind of just do the thing. Yeah. and they can do that for a while enough that it's useful.

Yeah.

how how intermediated do you think this product will be? And what I what I mean is that like uh let's just use some toy example like you know a widgets company would benefit from a custom fine-tuned model or RL model. Uh but they don't go to you. There's actually a company that's an intermediary that is providing like a SAS product that then is fine-tuned on like anonymized industry data or they went and generated RL or they'll even come to the company and say, "Hey, we'll handle we'll prime CLI. You're not going to have to know what that is. You just give us the data and and we'll act as like your customer." How do you think that plays out in the market?

I mean, it's definitely going to happen across the spectrum. I think the people who we work who we work the most directly with are the ones who are a little more like AI native and the ones who are going to work with kind of because I think we're really like building for developers okay as our kind of target audience but not necessarily researchers. So I think like the people who are like

following the benchmarks and reading about the new model releases and building with cloud code and the agent frameworks that's really like our target audience people who like think about evals and prompting um versus people who like don't think about that. Uh so we do actually work with a lot of like the big data companies where um there's like I think maybe the one interesting story is like there's a lot of market demand for um because everyone's like building environments and selling them to the labs. Yep.

Um but you can see a lot of these companies want to like know that their environments are good and so like using RL as part of this process

is a way that you value evaluate the quality and be able to prove like hey we got we got the good stuff um

because it actually like improves capabilities. Um, and so there is this whole economy of companies that really specialize on building environments and working with data. And I imagine this does become a big part of like the way that uh this stuff is consumed by end companies is through people with that kind of expertise at the data level.

Jordan,

talk about uh what the Chinese lab I was going to ask the exact same thing

in terms of American models.

How how talk about kind of the scale.

I've seen some rave reviews. I I I've genuinely se seen some rave reviews of Kimmy K2 and then at the same time I've also seen like it kind of fell flat on its face when I pushed it beyond a toy example. So yeah, what what's real and what are you experiencing?

Yeah, so I think the they're definitely a couple months behind like they're not at the 4.6 or the codeex 5.3 level. Um but they're pretty close to what we had before that. Um and I think that's kind of where it's been and it feels like this is tightening. But I think at least where I where I get most excited is like they're good enough that going the extra mile with customization is a differentiator where you can take a model that's already almost frontier

and make it the best model in the world at your thing pretty easily and pretty quickly. Um, and so I think that is even if you have to do this every three months like there's it's always a capabilities race but I think if this pipeline if this process of like taking your data and improving the latest model becomes really easy and repeatable uh then it's like you can get a lot of value out of doing that and I think that's the sort of thing that's going to be in a lot of people's toolkits. Um in terms of like the open source models generally, I think like there was some interesting uh debate on the timeline today that I dove into for a little bit around um anthropic and uh deepseek and the uh doing distillation. And I think like it feels like there's there's kind of two things. There's the kind of geopolitical element. Um there's the kind of like terms of service of like oh they're doing bot farms, they're scraping like that's not allowed. Then there's also like the idea of like distillation more broadly of like is it and the two I totally get the first two but I think the thing where I was kind of like trying to push back a bit was like I mean everything on GitHub is someone typing a prompt to claude

and submitting it to cloud code and then they're going to review the PR and then they're going to merge it and this is like perfect training data

and so the internet is just getting flooded with perfect claw distillation training data.

Interesting. Yeah.

And and there's not much you can do about that. And so it's like is distillation really the hill we want to die on? Um

okay. Yeah. Um I I I guess the the secondary question is like uh put all of the that aside and then just ask the question of like of like is there some you know ticking time bomb with using a distilled model where you run into some wall or you have some problem in performance down the road. And so, yeah, you're doing well on benchmarks, but then it's just a less effective. And is that like actually problematic from a business perspective? Or is it just like, okay, yeah, like I'm getting three months behind, but it's three, you know, three times cheaper, so I'm fine with that tradeoff versus like I thought I was using something great, and then it it blew up on me,

right? So, it depends a lot on your uh applic. So they think there's certain things where like the models are already like more than good enough and these are like kind of more commodity like extraction or summarization or labeling use cases um where like you kind of just want to optimize for cost in some cases you want to optimize for speed. If you want to optimize for performance then if like cost isn't a concern and you really just care about topline performance then customization is really where the open source models become interesting which is that uh like you can do more to the open source models than you can do the cloud and you can have a lot more fine grain control of like saying hey this is my eval this is how I'm measuring performance we are just going to hill climb this and then it's up to you as a business to define your business logic say hey this is what I actually care about this is what performance means Um, and I think we'll see a lot of companies realizing that like that is a useful knob to be able to turn to be able to like and I think concretely what it'll look like for a lot of cases is um there'll be these multi- aent products that have their main orchestrator agent that's like one of the frontier models with lots of specialized sub aents for things that are related to the business and specific workflows which are then fine dude models. I think that's kind of what we see currently as like the most uh like dominant paradigm for mix and matching between the the uh proprietary models and the the fine-tuned open models.

If you had told someone a year ago that there were going to be like probably millions of people running agents locally with custom setups and MD files for various skills, they'd probably be like, "Wow, that's pretty aggressive." Uh do you think that we'll be in a world in like a year or two where uh at least you know people on X will be talking about like my fine tune I got the I I I did I did RL on my Pacific problem. My personalized agent is like even better now because I did the RL. I mean, so we see it today already with this a little bit where it's like I mean, there's people who are showing you can these you can get these models to beat any of the closed source models on sufficiently well scoped tasks

pretty quickly.

Um it's not rocket science. You can you can basically vibe code it.

you have to know you have to like

be clear that you have a goal in mind. But if you can define the goal and you can spell this out in English and you can do the same sort of prompting that everyone's doing for coding. Yeah. then yeah, you can just kind of plug it in and uh

get to work. But I think it'll become more like a lot of it is still very much like these kind of more proof of concept or narrow research cases. Totally.

Um but it does seem like it's quickly especially like code becomes cheap and the more the cheaper that code gets, the more complex you can make your environments. Um, and I think like a year ago we saw like Cloud Code's about a year old came out I think February last year. And at the time it was like wasn't actually that useful yet. But I remember playing with it and feeling like oh this isn't actually something I want to use that heavily today because it's kind of slopp. It's very chaotic. It just makes a mess and I was I went back to cursor for a while um because it was much more controlled. But it was like oh this form factor feels like it could eventually work. other form factors today in some ways like the open claw thing where it's like open it. Same with like if you saw like the Gas Town thing or these like crazy multi- aent systems where it's like

they aren't actually excellent yet for shipping quality production code. But the thing we had a year ago now is at the level where like clawed code is used for like most production code um but by the heavy adopters or codecs. And so like

it it feels like it is a matter of time until the these things stabilize and like the

goals of having that system kind of end back up in the models for people training for it. M U but like the recipes of how to train these models uh it's they've become like robust enough over the past year that it does seem to be like a good idea in a lot of these cases to to optimize your models for the structure you want them to be in and if that structure is this crazy multi- aent system thing it's like yeah why not?

Yeah.

Uh what are are you expecting real tangible breakthroughs in the next uh in the first half of this year? I mean, our our intern keeps saying that uh he's close to cracking continual learning.

Oh, yeah. Continual learning is going to fall pretty quickly. I think

it'll be less of a big thing than

I mean, I think it's more of an engineering problem. I think it's like

explain.

No one's actually trying.

No one's actually trying. Why not?

Like no one like no one like OpenAI and Anthropic don't want to continuously train their models for each user. Like that's it's expensive and annoying and hard to serve at scale. Um but like from a research perspective like we're we do continue learning where the model learns new they just could keep training the model more and it knows more stuff because they put more internet in it

and uh

yeah yeah yeah uneconomical right now but uh but yeah I

for a product like Frontier I could imagine that that would be a selling point if you're Mckenzie and you're going to a big

institution. So so so yeah if you if you hypothetically like I don't know you're our law firm and there's some crazy case update like yeah the model retrains on that like the day that the Supreme Court completely changes the way the law works and then everything else is like interpreted from that. Yeah,

makes a ton of sense.

Yeah. There's enough kind of tricks. I think there's a lot of experimentation around like exactly the recipe that's going to be the most reliable, but we kind of have a grab bag of like six or seven tricks that kind of work or they work in different ways and you can mix and match them.

And

it's just going to be like whatever's the best combination of these tricks, people are going to experiment with it and find the versions that work the best. And there doesn't seem to be any like big wall in sight that prevents that from like being practical.

That's cool. Uh, what are you tracking on the silicon side? We were playing around with chat. Jimmy.ai.

Oh, yeah. That was sick.

Crazy, right?

Jimmy's quick,

but is he smart?

Too fast.

You have to like scroll up once you get the answer.

Yeah. I was trying to see how many tokens I could get it to print so that I could actually see it go. And I was like,

give me every number between 1 and like 10,000.

But like, llama just won't do that. No matter how you prompt it, it'll always stop after like a few thousand tokens. Oh, interesting.

So, you can't actually get to feel it like blitzing past.

Whoa. Interesting. Yeah. Yeah. Yeah. Yeah. It was like sort of a throwback experiencing Llama 38B because I remember when that model came out and there was a lot of hype because open source developers just love open source stuff and it was exciting and it was cool. It was like, wow, they really did train a big model and they just put it out there. Uh, and I remember some people being like,

yeah, like if you actually go talk to it, like it it hallucinates a fair amount. Like I don't know that this is like actually at the frontier. might have done okay on some benchmarks, but it's not quite there. Uh, and it was a little bit of a throwback, but you can just imagine baking any of the current frontier back there, giving it access to tools, giving it a reasoning loop. Like, yeah, it's going to be even if it's only 10 times as fast. Like, that's still so much faster than like, okay, got to close the app and come back after 20 minutes cuz my thing is running now. It's going to be a completely different and I think it'll be a big like step change for like people that are like, "Oh yeah, AI like hallucinates and like I need to check that out." It'll be like, "No, like it's to like totally you can just have it right there and it's perfect and it works a ton very fast." It's going to be a really cool moment. Will you be buying an AI lamp?

An AI? I want the one that goes over your bed and folds your clothes.

Oh, okay.

Have you seen that one? It's It looks like a Pixar.

We have

It also looks like it might dismember you if it doesn't like you. It's a little bit horrific, but I do agree. If it folds your laundry, like that's pretty pretty amazing.

I don't care if there's a one in 10,000 chance that it goes crazy and it's just

I don't care if there's like a one in 10 chance of me just being dismembered in the middle of my night because it gets mad at me because I was trying to it prompt injected or something. No. Uh I I am excited for hardware. It feels like uh it feels like even even the first gen hardware like the humane AI pin the uh the rabbit R1 all that stuff with like frontier models starts to get interesting. I really hope we get a solid next iteration there. uh even though it's obviously very much outside of your core competency but maybe some hardware developers will be coming to you uh looking to fine-tune a model RL a model

if you want local on device for something that's way to like cuz yeah you can I think especially for like these narrow things like if the R the rabbit whatever and this is also Apple strategy it seems like cuz Apple's like they like keeping stuff on device the whole privacy thing is part of their whole pitch and so I think part of the reason why Apple's been slow on the AI stuff is they're shipping a feature once, they can do it on device with a sufficient reliability. And so that means they're slower in their rolling out of features, but it means that like

the stuff like the summarization and the image search, like they can do this locally now because the hardware is good enough and the models are good enough at that scale.

Yeah. Yeah. You have to imagine that that the same Talis principle of like baking the model down to silicon. Well, it feels like they're doing something maybe like wafer scale, like not iPhone scale. So, like maybe that's another couple years and then you need another couple years to get it to okay, it's now uh Frontier on a chip that's the size of your phone, fits in your phone, doesn't suck your battery down, but you play that out and you get to something like really really fun and interesting.

I'm excited.

Yeah,

future is bright.

And I think the Yeah, definitely exciting. I think the people always said the internet was going to like run out of data, but I think what we're like we're getting more data, but it's and it's better data because it's just from the last generation of models.

Oh, interesting. And so you can kind of like you kind of get this flywheel of like there's just more data to learn from and it's all getting better as the models get better and you do more on top of that to boost beyond where you were from the old data and that's where the RL and the filtering comes in and the human data but like seems like you just have a pretty clear path of models getting better as you put more data into them and we have the data.

Well, thank you for coming on the show and producing a bunch more data.

That's helpful. It goes onto YouTube.

It's an honor to produce data with you. It's it's an honor to join the training set with you.

Yeah, that's the goal.

That is the goal. And thank you to everyone in the chat who's also providing data for the internet.

It's God's word. Thanks for having us.

We'll talk to you soon. Will talk soon. Have a good one.

Let me tell you about Plaid. Plaid powers the apps you use to spend, save, borrow, and invest securely connecting bank accounts to move money, fight fraud, and improve lending now with AI. And speaking of data, let me tell you about Labelbox, RL environments, voice, robotics, eval, and expert human data. Labelbox is the data factory behind the world's leading AI teams. And I believe we have our next guest already in the reream waiting room. 5 minutes ahead of schedule. Michelle Lee from Medra is in

← Back to story