Alex Albert on Claude Opus 4.1: drop-in upgrade with agentic reasoning gains and same pricing
Aug 5, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Alex Albert
Congratulations would be able to help make CBS relevant again even if she was continuing to run the free press. Anyway, anyways, without further ado, we have Alex Albert. Alex, welcome to the stream. Thank you so much for joining on short notice. Great to meet you. Great to have you. How are you doing?
Great to meet you guys. Very excited to be on here. And I'm hoping you hold on to that uh that mallet so we can get a gong ring in. Give us some news. What you got? We went from Yeah, it's Oh, go ahead. No, you go for it. Yeah. Yeah. I was going to say it's an exciting day. Uh we dropped Claude Opus 4.
1 just about three hours ago. You got to give us a number. How many parameters? How much money do you spend on the thing? I can give you a sweet bench score. Okay. Yeah. Give me a bench score. How is it. 5? Congratulations. Thank you. Thank you. It's been a very very exciting week. Uh we're shipping like crazy.
A lot of stuff happening. Um don't know how you guys are even keeping up over here. Oh, it's it's amazing. It's Yeah, I mean, we've been talking about streaming for seven hours a day. We'll we'll get there eventually.
Fortunately, 3 hours is like just enough to barely cover scratch the surface of what's going on in tech. Uh talk to me about where this fits into the rest of the landscape.
There's a ton of stuff going on in open in uh in anthropics world uh from cloud code, different consumer models, open there's so many different projects. How does this fit in? Yeah.
So this is our best model yet for just general capabilities and intelligence really really is pushing the frontier on all sorts of agentic reasoning coding tasks.
Um back in May we kind of hinted at the launch of Cloud 4 that we want to get on a more continual release schedule with our upcoming models and this is kind of our first step towards that vision of now being able to ship models faster and faster to our customers and end users. Yeah.
One of the narratives that's kind of stuck around with Claude for a long time is this idea of like the vibes just being really good and yeah it does great on the benchmarks but it's just like fun to talk to.
People like the way it talks but uh at the same time Anthropic's been fantastically successful in the business context. Is there an important overlap there? Like are you hearing from business customers that the vibes being good is driving bottom line results? Well, yeah.
More so, how much do developers care about vibes versus just raw? It seems like developers are the only ones that can tell that the vibes are good, but but I'm always interested like how does that actually translate to like my bottom line if I'm an enterprise SAS company and I need to use a anthropic API for something.
Am I am I happy with the vibes or is that just kind of like a nice to have on top? Yeah, I mean it the the vibes are kind of there's you can measure vibes in a lot of different areas, right? So if you're a developer, you're measuring vibes in in the coding domain and how long of a task can these models complete?
Can you just let it go off aentically and work on your files? Um if you're creating an application, you're enterprise that has end users and you're having those end users actually interact with the model. You want those personality vibes to be uh pretty stellar.
Otherwise, you're just going to give a bad experience to your customer. Um I think Claude has like a really wide range of things there. And of course, we're known for kind of that natural feeling personality and we've done a lot of character work there.
We have some really great researchers on the team that focus on that. But then on the other side of the spectrum for the developers, for those folks that are building agents, um that's where we really get into that that other type of vibe, which is, you know, how how long of a task horizon can these models operate on?
Is there a good benchmark for for task horizon yet? We've heard baby AGI, 15 minute AGI, 20-minute AGI.
We've seen a lot of people like string these things together in but like in terms of like most consumer interactions I think the where where most people hit the interface of an LLM they feel like like 20 minutes is the kind of the upper bound of what the average AI user is experiencing.
Do can you tell talk to me about that? Yeah, I mean I I do think uh LLMs right now have somewhat of like a jagged intelligence frontier where in some tasks of course they spike more than others. Um for me personally when I'm trying to assess that capability in a model there's a couple places I look.
One is especially coding benchmarks. Um these are really really important because often it's the model trying to complete a PR end to end. So it needs to take a lot of actions. It needs to call a lot of tools and modify a lot of files.
Um so things like sweet bench are actually like good proxies in some degree to measuring that. Um there's also thirdparty institutions that are doing this research right now.
So one is is called meter their model evaluation threat research and they have that really nice that really nice line goes up graph of like plotting out the models and how long of a task they can do compared to a human developer. Sure.
Um, and Claude has always been uh pushing the frontier on that and I'm assuming uh once they get 4. 1 up there, it'll be be up and to the right as well.
Yeah, I mean we've seen like the in terms of the spiky intelligence like ArcGI is like the main thing that people point to as like a weird valley of capability because it seems like it should be obvious and and easy to do and yet it reveals something about the models.
Uh are there any are there any places in the coding world that are more specific where we're underperforming either just because we haven't focused on it? I'm thinking about like like I imagine that most models are better at Python than for but but is that I is that an outdated take?
I is forran now just kind of like solved because somebody had just got all the training data together and RL done for and now we're good.
Yeah, I I definitely think there's still probably if you were to like take Sweetbench for example and you translate it into every coding language, I'm sure Python and and JavaScript, TypeScript would be totally uh ahead of the other languages.
But we are seeing especially with the cloud for models that they're able to handle a lot more of those like older languages or the languages that don't really appear as much in in the training data to a much better degree.
And that's through a combination of of things from RL to just general uh architecture improvements. Um so we're we're kind of covering the bases with the models, but yeah, it definitely started in one way and now is expanding out. Uh what what have uh reactions been to 41?
I'm assuming a variety of people got access to it, you know, pre-release. What what are the areas that people are kind of most excited about? Yeah. Um again, it's it's kind of those coding and agentic tasks have been really the star of the show here.
Uh what I think is interesting about that is that coding is kind of the proxy for allowing everything else. So if you think about how a model actually interacts with things on a computer, uh everything can be done through coding operations.
So when we say coding, it's almost like a little disingenuous to just like narrow it to that domain. Um, but some of our early customers that were playing around with 4. 1, Windsurf being a great example, in one of their benchmarks, they showed that from Opus 4 to 4.
1 showed roughly the same performance jump from sonnet 3. 7 to sonnet 4. And this was like a real world junior developer task eval. Um, so something that I definitely would pay attention to. How are things going on the rate limit side?
have heard that like capacity is generally constrained for everyone everywhere all the time because we don't have enough data center obviously people are working on that but uh like I I guess like broad update on like how things are going but then more specifically um like when you're talking to companies that are building on top of anthropic APIs are there specific strategies that you recommend to just embrace the realities of the modern AI era and capacity constraints.
Yeah, I mean the rate situation is something we're just continuing to iterate on and of course we have like very very smart people much smarter than I think about this problem in terms of building out compute and also working on making our inference just overall more efficient.
Um generally we want to provide the best experience we can to our customers.
So whether this is on the cloud code or through the API, um there's various things that we do whether it's like through our applied AI team helping our our enterprise customers with like their specific deployments or on cloud code making sure that our rate limiting system is fair and actually allows for people to utilize it as much as they can without rewarding people that might be like abusing the system and kind of detrimenting the other users.
Uh so it's really about like kind of monitoring and keeping it consistent so we ensure like great experiences for everyone that's trying to use cloud. Uh, what about pricing? Is there a step up in in like cost associated with 41 versus 40? What what does that look like? Pricing is the exact same.
Uh, we really do intend for this model to just be like a drop in replacement for Opus 4. So, if you're using Opus 4, you should just be able to to cleanly kind of switch over the model string and you're good to go. That's awesome. Yeah. What are some of the weird use cases that you've seen for cloud code?
I've talked to some people that are like, "Yeah, like they because coding is kind of like an abstraction over like everything that a computer can you can do, they're coming to claude code with tasks that don't that feel more like just general like personal assistant agent stuff, but then it just writes a bunch of code.
" Um, are is is that real?
Is that is that something that it's more just like when you know you know when people do those like beautiful artworks in Excel where they like color every single cell as like pixels and it's like that you're not you're kind of abusing the tool here but it's still impressive or or is that actually like you know a glimpse into the future?
I think that is a glimpse into the future. Um there there was a a tweet I posted a few days ago kind of asking the community for, hey, what's all the best non-coding use cases you're using Cloud Code for? And it got hundreds of replies and really the answers kind of were super wide ranging.
Everything from like managing your own personal knowledge base, second brain on your computer, you know, maybe you have a ton of notes and text files to some folks are actually using it now for like video production.
So we have we have a guy on our team that uh uses cloud code and uses like an open source library to create animations and things for like all our product launches. Which is just absolutely amazing. I didn't even know that was possible.
Uh but it is kind of hinting at this thing that it's like well if you can control code you can really control anything that happens on a computer. Yeah. So now we're moving into this like operating system abstraction almost. Um and I think that's really the direction you're going to start to see cloud code heading in.
That's awesome. Anything else, Jordy? any any uh I'm I'm curious to get uh an update on it seems like a lot of the companies that we talked to. You know, we were talking with uh Dylan Field the day of the IPO and he was talking about updates they're making with with M MCP. What's going on uh in that realm?
Any updates while you're here on that front or or interesting use cases? Yeah, I mean MCP's been absolutely just ripping lately.
Um the community is awesome and it's like really got to this point now where it's almost its own like self- sustaining thing uh where there's like meetups and gatherings that are being hosted completely outside of of like our facilitating and um thankfully we get to participate in them and that's just like been super cool to see.
Uh I think MCP is going to be a large part of like making sure this kind of agentic future goes well.
So if you can spin up easy integrations and connections to any service or any product or any internal application, all of a sudden you kind of um unhobble Cloud to some degree and you give it access to that knowledge that it needs to actually go out and do those correct operations.
Um so very excited about the team there and yeah, hopefully a lot more good stuff coming on the way there soon as well. Awesome. Very cool. Well, congratulations to the whole team on the launch. It's a massive day, massive week, and uh love to have you back on soon. I'm sure they'll I'm sure 4.
2 will be right around the corner. Oh yeah, lot more coming soon. Yeah, stay tuned. Have a good one. Cheers. Bye. Bye. Up next, we have Scott Kapor from Andre Horowitz. He was in the OPM business, the other people's money business. Now he's the director of OPM