Cognition's Walden Yan: Devon blacks out the model layer so users just get results, not tradeoffs

Jun 19, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Walden Yan

uh everything that's going on in the AI ecosystem. Walden, are you there? Welcome to the stream. Yes, it is great to be on here. How are you guys doing? Uh I'm doing great. Uh thanks so much for stopping by. Uh would you mind introducing uh yourself in the context of cognition?

We've obviously had Scott on the show multiple times uh and people are probably familiar with Devon and Cognition, but I'd love to know a little bit more about your story, how you wound up there and what you're working on kind of day-to-day. Absolutely. Um I I was a good friend with Scott before we started Cognition.

We did sim same competition series growing up. Um and I was kind of also working on just various ways of working with these new programming agents. I was really waking up every day trying to figure this out. when I caught up with Scott, we we figured out that, hey, we we were both very interested in a similar thing.

We had a group of people that were all, you know, ready to to jump at this opportunity and and that's how we got it together. Um, so today here I'm I'm chief product officer and co-founder.

A lot of the time, um, honestly, I I think many times people think of product as just like the interface or the UI or the integrations.

I really do think the intelligence and brain behind Devon is so fundamental to how you think about the product that um we we build our product team so that individual people are you know tuning the weights of the models but they're also the ones talking to the customers and so in terms of the role I have it's pretty broad and I I like to you know spend some weeks you know really deep into how do we make dev more responsive how do we make it smarter and then other times you know really you know going and talking to customers working on the UI, things like that.

Cool. Uh I want to dive right into that that question about uh tradeoffs in models at from a product perspective. Um my my question is uh we talked to Mike from ARC AGI about the paro frontier. I'm feeling it personally.

I'm feeling the AGI but I'm also feeling the the delay of the AGI uh when I open up Chachi PT and I have to decide between 40 and 03 Pro. Am I going to wait 12 minutes for the really good response or do I want something now that might hallucinate and I don't know if it's right and and I'm I'm doing that work.

It feels like OpenAI is is starting to tuck those features under UI and already it's kind of it feels like it's learning when I want to use 03 Pro and and making these buttons easier to access and they're tucking models under UI layers.

Talk to me about in the context of Devon, how are you how are you using different models and and when do you leave that up to the developer versus uh versus something that that you as a product can make a an even better decision than the human? Yeah. You know, it's so funny.

The AI is coming so fast, but it feels like it can never come fast enough. Yep. There was really this time I I think it was probably around two years ago.

I was taking bet with a friend at the point these models were not even that good at math and he said, "Oh, you know, I think they're going to get like a gold medal at like the International Math Olympiad in just a year. " I thought he was crazy. I took a bet against him and I absolutely lost that bet.

I've learned to kind of adjust my expectations upward. I think what you're pointing out is that as these things get smarter, they don't uniformly get smarter at everything. And you'll find that sometimes there'll be a model that will take 15 minutes to figure out how to respond to high.

And then there are, you know, there are models that, you know, do respond super fast but are not nearly as intelligent. I think one thing that we do as a product in in Devon that is a bit different from other people is we kind of blackbox the models away.

And part of that is out of you know we can then test and use a bunch of different models under the hood and kind of hide that you know uh all that complexity from the users.

You know, when when you buy a chip, like sure, you'll look at like or when you buy a computer, you're sure you'll look at like, oh, like has this much RAM, has this much CPU if you're into computers, but you're not like looking into all the indiv individual specs of the exact chip and model and and things like that.

I think that's where the space is going to move is people want systems that are just going to work and you know we can put in the months to you know in human years of effort it takes to evaluate models and figure out what is this actually good at so that an individual user who's just paying $20 a month doesn't have to figure that out.

It it's going to be one of these things though I I think the models are coming on so fast that it only becomes harder and harder to keep up with with all of this. And so eventually I think people are just going to get to the point where they just want things to work and and that's kind of where where we're starting off.

Uh talk to me more about uh AI winning an IMO gold medal in 2025. Poly market has it down at like an 11% chance. It was up at 70%.

Uh I don't know if that's a if if that's an aberration because of when this actual test will be run, but it sounded like you were very confident that I remember when Scott was on he was like it's definitely going to happen. Uh but the poly market's been down.

that all the people that would go through the effort of trying to do it are too busy working I I think um so yeah when I when I basically said I I think I lost that bet it's because we were only one point away from like a gold medal last year okay and that was already much farther than than than we expected yeah when you look at the poly that's a very interesting way to put it I think part of it is people have considered that already completed and so perhaps researchers aren't into it like who knows if they'll actually come out with a new release because maybe in Google's mind for instance if come out with a gold medal on the IMO, everyone's not going to even care because people just accepted that was going to happen.

So, oh, I think it would be I think it would be the biggest news of the day. I I I think we got to get Google comms in on this. They got to do this. I I think it's an easy easy thousand like banger on X. But you are absolutely right.

Yeah, it seems like top of mind for everyone, the labs, product developers, is really getting coding agents. And part of that is because there's this belief that if you get these coding agents to work really well, then that'll just solve the rest of the research problem for you.

We have this joke internally that the only code we have to get Devon to be good at writing is Devon's own code and then it can solve solve the rest of reinforces. Makes sense. Um do on that question of like the the spiky intelligence narrow reinforcement learning on specific tasks.

Uh maybe we think we're good enough at IMO level math and so we're not going to go for that last point. Um, where where are we still early in the RLing around specific coding challenges?

I've heard that uh distributed systems can be really difficult because you have to spin up all these different uh pieces of the system and that just takes longer and so you can't simulate as fast as just like a small Python block of code that you can run in simulation in a millisecond.

Um or or if we're talking about like I know Devon's useful for like replatforming from you know . NET to Python or something or you know even go back to forran it'd be great to just not have any of that code the legacy code sitting around.

Um but is there enough training data around those older programming languages or less used programming languages or are you are you optimistic about new training runs? Maybe we don't get something that's like oh it feels way better. The vibe's way better. The IQ went up by a ton but it's way better.

It's something that's really relevant to you. Is that is that important? Right now, my mental model of these systems is their IQ is so much higher than any individual person I know. But what makes them still bad at specific things?

It's like, you know, someone who has the potential to be a really great engineer but hasn't gone to trade school yet to actually practice that.

So nowadays I actually think about how smart these models are less in terms of how much training data are they being fed, what language are they being fed, but actually more so in terms of the environments that they're beingled in. And so one example I have of this is sometimes you can actually feel the reward function.

Um, back a few months ago when Anthropic released their their like Sonet 3. 7 model, one of the top complaints of people was, hey, like it seems like this model is like super great now like finding all the files it needs to change, coming up with the strategy, but it's really overeager.

It just changes a lot of different things.

And I think lot some people suspect that it's because when Anthropic was training the model, they told it, hey, we're going to give you points on how many of like the correct things did you do and maybe they forgot to dock points for for doing things that were kind of outside of that zone.

They fixed these from now on, but you get these little leaks of hey like you can kind of feel the reward function underneath these things. So when you when we talk about hey can uh can these things not do distributed programming yet?

Actually, one I in my opinion the biggest thing that these models aren't great at yet is actually debugging live code. So I I think part of the reason is it's actually really hard to create and rerun environments that in that interact with live systems, right?

And so if if your task depends on, you know, working against a live customer or working against a live stream of events, these are things that they it's going to be hard to replicate in these RL environments. And so you still find the models are are bad today. The good news is this these aren't like fundamental limits.

I think these are all engineering challenges. They're less like theoretical challenges, but it takes work to to build build up to that point. Can you explain uh reward hacking at a high level and then kind of give me some examples of uh of of how that uh interfaces with AI agent and coding agent specifically.

Absolutely. the the way to to think about these systems is they are just trying to maximize a a number. So if you tell it, hey, we'll give you like um we'll give you a point for every time that you do XYZ, you'll find that hey that model will just keep keep on doing XYZ, keep on doing XYZ.

I think the classic example of this is uh the like paperclip generating machine.

So like you know if you give it points for generating paper clips but don't account anything else in in the world that is important for humanity you know then the system might do really bad things just to keep keep on generating paper clips in the context of code.

One example we've seen of this is hey if your thing is just guess get all the tests to pass you might find that the system will just learn to delete the tests or make make the test just like say okay I pass um rather than actually fixing the code.

So a lot of times you just which no software no real software engineer would ever do that right. No no no human has ever done that. Comment out the test. Okay it's working enough well enough. Absolutely. It's it's almost too human. It's great.

And I I think there um there also like it reminds me of these systems that like were trained on Slack uh responses and when you would ask the system, hey can you do this for me? It would say oh like I'll get back to you on Monday. Obviously, yeah. What you what you try to get the model better at really really matters.

You have to be very thoughtful about it. Yeah. Yeah. I've noticed that with uh with the whi some of the whisper transcriptions, if you don't feed it enough text, it'll just say uh uh please like and subscribe. And it's like, uh okay, I know exactly where your training data came from.

Like that's its default phrase because it's just like what it's what it's hearing. Uh Jord, you have um what what uh h how are you guys approaching talent acquisition as a firm? you know, the headlines from this week are are these talent wars.

You guys have raised a lot of money, but I certainly imagine you're not making, you know, nine nine figure offers or or even trying to compete um there. But what's been the approach? Uh does it mean you're you know, keeping team sizes smaller or or you know, kind of dig into that for us?

Yeah, the fundamental bet of the product we're building is it revolves around this idea that individual people will just be able to be way more levered up because they'll be able to work with agents and they will be able to work with all these tools to make themselves better.

So at a minimum we we can't be hiring people who their whole aspiration in life is to just you know write code at the level which Devon will be able to do in like you know a year or two years from now. In many ways, I think we're kind of figuring out how do you build up an org from scratch that is AI native.

And one thing that this already means is we actually kind of just delete some teams. A lot of companies at our stage, they have like a internal tools team to maintain all the different services that engineers internally use. We found that internal tools are one of these things that AIS are just really good at.

M and we can just staff that team with devons and then basically have engineers just send in requests to those devons for how to do that work. Um and that doesn't just save us headcount.

I think fundamentally the the structure for how how do how does management work and how do tasks get passed down look very different especially in a lot of large companies you'll see today.

The way it works is an engineer will get a task assigned to them and then they'll go work on that task and when they're done be like okay what's my next task and then you know you'll kind of like go down the list of tasks you have but here every engineer is like constantly juggling like three or four tasks partly because you know we're not trying to hire super fast but also because you can juggle many tasks when you have these minions that can go and and you know work on working on your things for you.

So it it means that I think we are very aggressive for people who we think that can fit these roles and become very well good generalists and as we build up this company make sure that we're building in a way that works in a world where AI can can do so many these different roles for you.

And I think there will be kind of like a moment for larger companies as well when they realize, oh shoot, all these structures and and and patterns of management that we've had in place are actually slowing us down from adopting AI. What will happen at that point?

I'm I'm very interested in in seeing but it is very clear from us and from our smaller customers that the the earlier you bring it in just the the lot a lot easier it is to you know kind of pick things up.

Are you tracking I mean there's been this like in in in the agent discourse there's been this discussion of like we've gotten 10-minute AGI yes these large models 4.

5 like they're incredibly intelligent extremely high IQ extremely knowledgeable they've compressed all of humanity's knowledge uh but they're only good for a minute now it feels like maybe 10 minutes with deep research that that's how most people interface with them um have you been tracking kind of the longest agentic run of a Devon process.

Is that a key metric?

Is there anything that you can share with us on like have you been able I is there an example that I could give where there's a lot of work to be done but it's all in Devon's wheelhouse so it just needs to go and grind for a couple hours and it does it without kind of getting lost like we know happens with a lot of these agents.

Yeah, absolutely. I I think a lot of people in the space have expressed this feeling now that they are feeling more and more like the bottleneck in these systems. Interesting.

And the the way this applies here is we have seen people get really really long tasks to work but sometimes it actually takes a lot of effort on your part up front to be able to get that to work.

I was talking with a customer yesterday where he said, "I just rewrote our entire testing system so that the error messages are a lot more clear and the tests actually guide you through solving them one by one.

" And but once he did that upfront work, he kind of just gave it to Devon and we were actually we in the product started sending him warnings that hey, your session is going on for really long. Are you sure this is actually working?

And he's like, no, no, it actually is because I did all this upfront work to get that to happen. Um, I do think that this kind of 10-minute AGI, 20-minute AGI, 40-minute AGI will just keep progressing and people will be able to be more hands-off.

But people will also find that you can kind of always extend that duration by being a better manager in some ways and and giving, you know, more clarity up front for exactly what you want. Yeah. I mean, just like real life, that makes a ton of sense. Uh, Jord, do you have another question?

Uh, last question from my side. I'm curious if if you know what what kind of learnings you're having around uh agentic interface design.

It feels like um the sort of the default when you think about agentic software is just some something that can effectively sub in for a team member on any different software tool whether it's Slack or linear.

You see this with with deep research where you you hit you ask it a question and then it asks you a bunch of clarifying questions kind of trying to build that test suite to get you to give it to more stuff so that it actually has something to run with. Yeah.

So is is messaging going to be like you know the dominant interface is there something else like what what are you what are you kind of seeing or experimenting with with um on that side? Absolutely. You know, it's funny.

I I saw someone post about this idea that a lot of these products now will like make you respond to, hey, does this look like a good plan? Do you have questions before I start? And some people find that annoying.

And uh I think this fundamentally comes down to as these things become more like co-workers, you know, some people just have certain working styles that they like. Some kind of co-workers, you know, work well together and others don't.

And it's funny as you build a product, we we find that some people just love the way Devon interacts and then other people are like Devon is too needy in these ways. Other people are saying like Devon doesn't ask me enough questions. And so there there are toggles and controls that you need to have here.

Kopathy recently gave a talk on how a lot of AI tools, not AI agents, but AI tools kind of implicitly have ways you can use them where you have more control and then ways you can use them where you have less control.

But when your interface is just chat, now the model actually has to become more intelligent and detect, hey, this seems like someone who just wants me to go off and do work and get back when they're done, or this seems like someone who's very curious and and wants to hear more about the system.

And so this is actually going to be I think work that we'll have to see people make on the intelligence of the agent side. Not so that they get better at coding but so that they know how to better get better at working with people. Yeah. Yeah. That makes sense.

That that's an I mean the good thing is you can have some type of quick uh conversation you know with the user around their preferences and how they like to work and then layer on the sort of real-time feedback and learning and and and understand a lot more about stuff. Yeah. Roughly how big is the team now?

Oh, on on the engineering side, we're we're probably just over 20 or so engineers and then we also the the entire company as a whole is around 40 people now. So almost 50. This is the magic number. You get stuff done.

We were just talking to uh the previous guest about how how uh the Steve Jobs set up a 50 person team to develop the first Apple product and the Tesla autopilot team was right around 50. There seems to be some magic number there.

So, seems like it seems like it's a fantastic time for the business where scale product but special size. Yeah. So, you have there's like two pizza teams here, but everyone kind of knows each other's name basically. Uh you're still you're still a tight-knit group. Anyway, anything else, Jordy? I think we're good.

Thank you so much for stopping by. This was fantastic. Thank you guys. We'll talk to you soon. Have a great day. See you. Bye. Um really quickly, let me tell you about bezel. Your bezel concierge is available now to source you any watch on the planet. Seriously, any watch. Go to getbzzel. com.

And we have our next guest on McCabe coming into the studio to tell