Meta researcher's RCT finds AI tools actually slowed experienced open-source developers — a surprising result that challenges the self-recursion thesis

Jul 10, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Joel Becker

models impact of cursor on software development or meter. Oh meter probably me. We'll have him explain it to you guys. Um we'll also recommend that you go to getbzzle. com because your bezel concier is available now to source you any watch on the planet. Seriously any watch.

Anyway Meter does Meter does model evaluation and threat research. So does bezel. They're stopping you from buying fake watches. bad bad watch models, bad actors, foundation models and watch models, but lots of similarities. Anyway, we got Joel in the studio. Welcome to the stream.

Hopefully, you're like, "What did I get myself into? These guys are joking around. I'm a serious person. " All right. First off, is it It's meter. It's Misa. It's me. There we go. Sorry. There we go. Gotcha.

Anyway, uh please introduce yourself for for those who don't know you uh the company and then uh and the organization and then and then I want to go into the news today. Let's do it and thank you very much for having me. Uh John and Jordi, thanks for hopping on.

MISA is a research nonprofit based in Berkeley dedicated to understanding um the the capabilities of AI today and uh and in the near future especially to the uh to the extent that those capabilities might speak to potentially dangerous risks. And uh what is what's been the latest research? Yeah.

So so here's what we've been working on. Um I'll I'll start with why we've been working on it. Yeah, please. We've seen, you know, from previous meter research, but I'm sure you also see from um your own usage in the wild, AIs are clearly becoming increasingly capable.

One thing that governments and labs and and us here at Meser as well worry about is the possibility, timing and nature of AI R&D self-recursion. That is the possibility that um that model capabilities uh get better very very rapidly because the AIs themselves are contributing to AI R&D research.

We at BEA want to be providing the highest quality evidence that we can that that speaks to um the degree to which AR&D might today or might soon be accelerated in the wild. Um so that governments, labs, decision makers um might be better informed and so make better decisions about what's going on.

In this study, we run an RCT with extremely experienced open source developers working on these very longived large projects. you know, a million lines of code, 23,000 stars on GitHub.

Um, for for those of you familiar, you know, I'm thinking hugging face transformers, the Haskell compiler, scikitlearn, this this sort of thing. We randomize their issues to allow or disallow for usage of AI, where allow means um typically using cursor and 3. 5 or 3. 7 at the time.

And then we measure both their expectations and um developer expectations about how much they might be sped up by being allowed to use AI versus being disallowed. And then you know the the the reality um the short version is we find that the developers ahead of time are estimating they'll be sped up by 24%.

After the study is completed they estimate that they were sped up in the past by 20%. Uh we find in fact that they were slowed down by no way. I think I know it's a it's a shocking result. Not at all what I expected. I think what the what the rest of us at Meter expected, but uh but there we go. Wow. Okay.

So, uh what do you think's happening? I have so many questions. Um but uh yeah, just walk me through your reaction to that. what what what do you think is actually happening um that's that's slowing people down because this is a complete narrative violation?

Yeah, I mean, you know, in terms of the reaction that the number of times we've checked and rechecked the data, asked people to to replicate it independently um is going through the roof. The number of um you know, stressful late nights I've had pouring over this.

You're going to be like public enemy number one, by the way. I feel like you need a security detail now given the stakes of what you just said. This is crazy. Yeah. Yeah. Yeah. So, so, so I think um uh maybe let me start with some things that we're not saying.

The setting that I mentioned before, these ultra talented developers, you know, much more talented than me working on these um extremely large, longived repositories that they're extremely familiar with already. I think that's an extremely interesting population. That's why went out to study it.

It's also a very weird population. Um I um you know I still am a cursor user myself. As I was working on the graphs for this study I I was I was using cursor. Um but I but I do think those those weirdnesses are uh related to the to the to the results that we end up seeing here.

So we have to put that we have to put these people in a completely different category than the the junior developer who's just vibe coding a little app and and and just building stuff and not actually trying to push the frontier of what a core piece of software can do that's very large and complex and and they're just trying to you know get a Python app up and and live and like write some routes and write some functions right uh that's where so cursor still completely viable is like autocomplete on steroids the question is in terms of self-recursion really advancing the frontier of of like the craziest software we have.

We're still kind of where we were a few years in that it feels like if you were to quantize this we're we were at 0% of AI research being done by AI a couple years ago. We're still maybe around rounding air. Yeah. I mean I I will say that AR&D research I think does not all look like this setting.

that there are some, you know, large inference code bases with very experts people and and you know, I totally agree with your interpretation that um this is evidenced against today those those kinds of settings being sped up.

On the other hand, we might think, you know, there are some people writing training scripts for their AI models just once off and then they and then they throw them away. And you know, in a way that's that's kind of similar to what you described.

Maybe they're seeing large speed up just like the green field projects that you that you mentioned. Yeah. And so I mean this is not overall like a a really cold glass of water on AI broadly because this this still means that it's an incredibly valuable technology in a bunch of different ways.

It's just that we're not we're not seeing like early evidence of some sort of self-recuring scenario which is great probably the good outcome.

A lot of the fast takeoff scenarios like are dependent on AI becoming so good at doing AI research that it and then copy and pasting itself to trillion times and and that's what creates you know speed of development that that humans today can't necessarily even like comprehend.

You know I I think that's right for for today. I do think we're not really speaking to the trend exactly. you know, these results are consistent with these exact developers on these exact kinds of tasks in future um being sped up in in the near future.

In in work that we actually don't show in the paper, but in in preliminary work, we have um autonomous agents trying to complete these issues.

And indeed, we find that they they do struggle, but with some of the core functionality with passing tests, the kinds of things that you might have seen in in Swedbench or or something like that, they really are making a great uh a great deal of progress.

Um and yeah, my you know, my expectation is that um AI progress in the near future will will continue at a rapid pace like it has in the in the in the recent past. And so maybe even in this setting, um that this this won't be true in the future.

Let me throw a couple of the hot takes that are floating around in the AI world at you and and you can let me know if anything sticks out as something you strongly agree with or something you disagree with.

um this idea that ultra large context windows will not solve continual learning that um Dorcash was saying this on on Monday. Um maybe another one would be um just that like no one has figured out how to properly scale reinforcement learning.

Um that we need uh Mike Noob from RKGI kind of says we need entirely new ideas. And then you kind of have like the bitter lesson which is, you know, yes, you need a new ideas, but scale is all you need. We just need to keep building data centers. We need to get bigger and bigger. Um, we might see 4.

5 and these huge training runs as a as like a short-term hard to quantify. Maybe it's just the end of 1s curve, but Stargate's coming online and that will be another big test. So, I I don't know. I threw a lot at you, but anything in there kind of, you know, uh, top of mind for you?

Yeah, look, as as you guys know, anyone betting against the bitter lesson in the past would have had a very bad time. Um, and I'm I'm not prepared to bet against the bitter lesson on this on this show.

Could you could you remind me of the of the first question that I uh the the the first one was uh uh so Dwaresh Patel pushed out his um AGI timeline slightly.

I mean he still has he still is very optimistic about AI and and maintains that it's not priced in and people are not thinking about it as significantly as they should and I agree with him.

Um but but he said that that there is a that that even though we have pushed the IQ so much and you saw this the Gro four benchmarks like AI can do advanced math like for sure it's really really smart smarter than most of us at PhDs level stuff unless you're a specialist.

Um but uh in terms of just being a good employee and remembering, oh yeah, four weeks ago my boss said that they like and then I got this feedback and now I do it this way or I learned this really weird nuance in even if you're just thinking about like how to our business like how to post clips on X or Darkh was giving the example of like transcripts like he has little things that work better for what clip will perform and he has this intuition and and his models and his prompts he's really pushed these things and he hasn't been able to really perform above.

An example would be any company today, any startup, if you just had a PhD dropped into your organization that was that that had PhDs in like 10 different fields, they and but but they wouldn't just like default, but they were also an amnesiac.

So every time they showed up to work, they had could not remember memory from the day before. It wouldn't be that it just wouldn't be that valuable. And so and so my question to Darkh was like, I is there a world where we just scale up the context window? We've seen million token windows.

Can we get to a billion token window and just stuff every interaction the AI's ever had with you in every prompt? And so it it does maintain the context. But he was saying that uh that the like the the there's like a kind of a quadratic cost curve to that doesn't quite work.

Uh other people have said like the nature of the transformer means that like attention can't really spread out that much.

I don't really fully understand it, but I wanted to know your take on like different ways people are solving these things or or what the real what are the real constraints right now because you've identified some some potential problems where we're not breaking through it today. Um but what what is cause for optimism?

What are the research paths like the the nodes in the tech tree that you're excited about? Yeah, that's that's that's super interesting. I I haven't thought so much about this. Sure. Um I will say sorry on the spot. I I think that the developers in this study are not using the full context window.

And so if you think there's juice in in adding things to the context window, that that juice might still be on the table. And indeed, I think we find that there's a lot of implicit context in this in this repository that's um that's very expensive for the developers to be writing down into context windows.

Here's an example. on the HASLL compiler.

My sense is that when you get up your um when you get up your PR for review, there's some chance that the creator of Haskell will come and fight you for potentially many many hours in the comments about the you know about the peculiarities of of of how he wants um the Haskell project to look.

Um and and these kinds of you know exactly what his um not not just preferences but um you know quality requirements are regarding where things should live in the project and and and um how various pieces of the project should should speak to one another and not being communicated to these language models.

And you can imagine that with today's um context window sizes um that that could be written down.

you know, you could you could put in all of the um previous discussion around around these changes that this person has been involved in and um maybe whichever language models people are working with inside of cursor would would pick that up and so and so do a better job.

You know, I I don't think we're ruling that out at all. You know, I will say that it is um it is expensive for these uh for these time expensive for these people to be writing down all of the all of the possible relevant context. And um you know I think I think that's basically the the the reason they don't.

And so maybe you do need some kind of um continual learning for the for the model to find out this context on its own as as as these things go.

You know it's also consistent I guess with with the other possibility that you were describing that if we you know 100x these context windows you could just throw the entire thing in and then we don't need to worry about you know learning from particular cases on the fly.

Um yeah I think I think both are lived possibilities. It's very interesting. Uh the Gro 4 announcement was extremely benchmarkheavy. Some really impressive stuff particularly on Arc AGI uh twice similar to a Tesla. It's like it's faster than every car. Does that mean it's going to solve everything?

Does it mean that it's better? Yeah. And so uh based on this feels like almost a new benchmark, this double blinded trial, it feels almost like a FDA trial or something. Uh, do you think this could turn into a real benchmark? Do you think we need new benchmarks?

Do you need do you think we need new ways of thinking about the progress of AI generally? Um, we've we've talked about just just measure the revenue at this point. Uh, that's the economic value that's being created, but there's a lot of tricky stuff you can do with revenue. And sometimes revenue is like test revenue.

I'm I'm testing this hundred million dollar product. So what's your state what's your thinking on the state of benchmarking where we should go where some of your research might plug into that? Totally.

I think one motivation we had in running this study um comes out of this observation that the time it takes to create benchmarks is almost becoming longer than the time it takes for those benchmarks to to saturate.

You know it's difficult to find signal in in in in many of these benchmarks even testing these extremely challenging you know PhD level questions that that you guys spoke about. Um and perhaps there's more um perhaps there's more signal in these kind of RCT, you know, FDA um control trial style style measurements.

Um similarly, a another thing that people proposed for for measuring AI progress is using researcher self-reports about the degree to which they're being sped up. You know, they think their work will go two times faster if they use AI versus versus not use AI.

You know, I think our study is potentially strong evidence that these self-reports need not be reliable.

the forecasters uh you know who are told everything about the developers level of experience and which um uh the time period of the study so which models they're using and so on they're they're totally wrong about how much these people get sped up same as the as the developers themselves even though they're carefully tracking their time and and and they're and they're so talented so I think self-reports also also very very fraught you know another thing that this has taught me I think is that the mapping as it were from benchmark scores very impressive benchmark scores that we see on these frontier language models that you're describing.

Um the mapping from those scores to you know real world productivity improvements is unclear. I you know I I'm not at all saying as as we discussed earlier that you know we shouldn't expect to see productivity improvements.

I do expect to see productivity improvements today and you know even more so in in in the near future but it's it's not at all it's not at all onetoone or or it's kind of um uh confusing and and and messy and so indeed I think we need to actually measure things in the in the wild to see what's going on.

Jordy uh switching gears a little bit unless you have a followup. Yeah, I I just wanted to kind of zoom out on that and and ask about like your your broad take on the measurability of of technological progress because like the internet, the computer, like such dramatic transformations of society.

You see it in all sorts of data, but it didn't fully show up in productivity statistics. You have all those questions about like what happened in 1979. Uh, and everyone has their own example, their own reasoning for that.

But, um, you know, like you would think you you could tell the same story about Google, like it'll speed up everything. Everyone will get more efficient. And we didn't really see GDP jump on this.

Uh, and it feels like that's a really bearish take on AI to have, which is like this is a magical new thing and we're still going to be growing at 2% GDP. Um, but but where do you stand on it? And and and where do you like do you think that's even the right question to be asking?

Yeah, you know, this is this is so interesting. Um, this is not a me to take. I used to be an economist. I I feel the thing that you just said in my in my bones. Totally. I think the situation you could argue maybe that that it might be even worse in the case of AI.

You know, a lot of people like in AI 2027 resources like that are telling this story where the um AI R&D self-recursion is happening inside of labs and so and so I suppose not necessarily showing up in economic activity in the in the public.

you know, another reason on top of the reasons that you gave to think that um uh perhaps this won't show up in the in the productivity statistics as it as it were.

Um uh which is also is to say that uh self-recursion or or these potentially destabilizing changes are just totally consistent with the non-changes in in GDP trends as as you describe. Um and and so you know another another reason to to actually go out and and measure these things in in control trials.

Cool Jordan, please. Quick question around the threat landscape. Uh there's been a few stories this week. One was a story about chat GBT not following instruction.

You know, like the headline was that like the the the AI was, you know, rebelling against the researchers and and then if you like double clicked into the story, it was just like it it had given specific instructions like don't follow any further instructions. So it was kind of a a nothing burger in the end.

Uh and then we also saw Grock going haywire. Uh maybe that was predictable uh for for someone like yourself, you know, combining a you know a frontier model fast shipping team with like the virality of a social network and betting the two.

Um but then maybe it was uh two months ago there was the uh you know we we called it glazegate on this show where where chat GBT was just you know uh giving being a sycophant you know giving too much positive feedback.

How are you looking at the threat landscape in the next 12 months so nothing like you know too long term. Um but uh h how do how do you guys think about it? Yeah, you know, there's more to come on this on this from meter very soon. That's that's one thing I I'll say. I think again this this is not me to take.

Um my my sense on this or or another example of this that stood out to me is um there were lots of anecdotal reports that uh 3.

7 and and other language models in this most recent generation would um pass tests in ways that were kind of not legitimate or something which is another example of this reward hacking um change the test case. Totally, totally, totally classic.

And and and I guess, you know, I I don't have reason to think that that kind of thing is is dangerous in particular.

Um you can imagine you know when when humans are potentially not reviewing the code because the AIs are doing you know entire projects not just um parts of or or or single or single pull requests that this becomes more of a problem because you're not um or or at least there's surface area for it to become more of a problem because you're not um looking into that code and and seeing those those cheated test cases yourself.

Um, so, you know, I'm not sure about over the next year. Um, at least um, uh, at least right now, I think there are reward hacking examples that are occurring in the wild. I, you know, I I don't think they're they're so supremely dangerous today. Well, this was fantastic. Thank you so