Riley Tomasek of Charlie Labs: 98% of merged code written by AI agent in August
Aug 7, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Riley Tomasek
deciding whether or not this is stagnation or hyper intelligence takeoff. Uh, and and we will be joined by our next guest, Riley from Charlie Labs.
Sorry.
Hey guys, thanks for having me.
Good to see you, Riley. How you doing?
What's happening?
I'm doing fantastic. We, uh, we've been heads down with GPT5. Uh, and
how long have you had it? How long did you get the preview? I feel like it it it you know it gets rolled out to early adopters a little bit earlier, but has been weeks, months. How long have you had it?
We're couple couple weeks, two or three.
What was the first Charlie liking it?
Charlie loves it. And also I love what Charlie does with it.
Yeah. What does Charlie do with it? What was the first thing you did with Chach GPT5?
Uh ran our evals.
Oh yeah. How'd they come back?
Just uh really good. um much better than 03 which was much better than any other model we've run before that.
Interesting. And yeah so so let's zoom out. What what do you do? What do these eval measure? Um walk me through it.
Um so Charlie is a Typescript focused coding agent.
Um that operates much more like a human does. Um so less like IDE application terminal and more joins your GitHub and Slack and linear workspaces. Um, and it interacts with the team the same way other humans do.
Um, and then our evals are a mix of code review, um, because part of Charlie's job is to review PRs from humans, um, as well as his own, and then, uh, code authoring, so opening PRs and pushing commits. Um,
so, so when you develop your own p your own evals, I imagine you try and keep those out of any training data. You want those to be held private. Is that correct?
Yes. And it's getting even harder with web access now because they're too good at finding things.
They're finding everything. It's funny. Um and then and then talk to me about like the shape of those uh of the actual problems in the eval. Are you are you doing are there some easy questions, some hard questions, some some some extremely hard questions? Like how are you formulating those? What's the shape of an individual task? Is it scored out of like a hundred? How do you think about developing a good eval? um a mix of hard to very hard. The easy ones are just a waste of money and time at this point, especially with five. Like there's a bunch that it's just not going to get wrong.
Yeah. Yeah.
Um and then we're mostly doing uh the PR ones look kind of like Swedbench in the sense that we're taking an issue to start with. Um but instead of giving the issue like in a Docker container already, um we trigger a comment on the issue that says, "Hey Charlie, go make a PR for this." Um, and then Charlie does his thing and then the PR comes up. Um, and then we score that PR against a whole bunch of things like correctness to a known solution that's correct as well as um code quality, testability and some softer things like descriptions.
Who are the biggest uh who are the biggest customers or use or users for like a TypeScript focused coding agent? Um, it's a wide range of mostly modern apps. Like pretty much any web app these days is going to be like a Nex.js type app. Um, and then all the way into like back end like Charlie himself is written in TypeScript.
Sure. Makes sense.
And there's very little front end.
Anything else? What else you got?
I just want to say I love the name Charlie. It's one of my favorite agent names that we've had on the show.
Yes. It's right up there with pig and what was the other one? Well, I don't think that was an agent, but uh
that was an agent, but
but yeah, it's a it's a good one.
Yeah,
congrats on locking it down.
Yeah. What what about what about um cost and uh and that side of the business? Is there is there any movement there or anything that you where where you require movement or you need movement to really unlock new capabilities in the business or new markets? Not really for us because we're operating kind of as at a human level. Um we do value based pricing. So we charge per PR or per commit. Um and because that's comparing to such expensive actions that humans are doing. The challenge for us is more actually living up to the promise than doing it cheap.
Yeah. Yeah. Uh are you having
but then but then doesn't the cost reduction announced today? Isn't that great for business?
Yeah. I mean, it's good overall, but like that's our problem is not that the models are expensive. It's that they're
I mean, they're getting really smart, but I'll always take more.
Never enough.
Like, for instance,
since the beginning of August, we've been testing um 98% of the code that got merged into our codebase was written by Charlie.
Wow.
Not 30, not 50, 98%. And that's coming through PRs. That's not like autocomplete in an ID type thing.
That's crazy.
Yeah. What Yeah. What does that mean for like the future of of like who are you hiring? I imagine that you're still, you know, a an engineeringheavy organization that's just puppeteering and orchestrating agents. Um, but where do you see like the future of um software development as a career path going?
Yeah. Are are uh new CS grads cooked?
I think if they get really good at using the AI, no. if they try and take an approach of getting really good at writing code by hand. For sure.
Yeah.
What we're mostly looking for hiring is people who are able to see things at a much higher level and plan further out because with tools like Charlie, you can write so much more code so quickly that it's like it's more important to see where you're going and take the right path than it is to be able to write it quickly.
Very cool. Well, thank you so much for stopping by. Good luck with the rest of your day and uh congrats on a on an upgrade to everything that you do.
Tell Charlie to have fun out there.
Have some fun.
Thanks a lot, guys.
We'll talk to you soon.
All right. Let me tell you about numeralhq.com. Sales tax and autopilot. Spend less than five minutes per month on sales tax compliance.
Sales tax super intelligence.com.
A number of the fellas in the chat got access to five.
Break it down for us.
Reg says it's pretty good. The writing ability feels a little nerfed. says, "The way it writes feels a little programmatic rather than sounding human, reverts to using points even for things like blog posts and also
uses overly complicated language for simple stuff."
Uh, Techno Chief says, "It's crazy fast."
Oh, that's
Dan Rat Ratliff says, "Yeah, I was just going to say that very, very, very fast." Um, Z Jean Ahmed says, "Junior devs are barbecued. Tyler, anything from your side before we talk to Germa from Verscell?
Um, I think maybe a good way to to like vibe check on at least on the timeline is that it's almost like a 4.5 kind of thing where comes out people are like this model totally sucks. Look at the benchmarks. It's like not it's not some massive improvement. It's like a bar, you know, not a step change at all. But then you you start playing with and it's actually like, okay, this actually a good model. Y like a lot of the stuff I'm seeing people post like, oh, that's actually like really like interesting output stuff like that. Um, but it seems good.
Can we do the green text eval green text