Mark Chen on GPT-5's reasoning leap, tool use, and why OpenAI is cautious about optimizing for DAUs

Aug 7, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

more, but we have our first guest. Let's welcome him to the stream. What a day. Mark, how you doing?

Hey, pretty good. Nice to see you guys again.

You uh congratulations on the launch. Uh take us through it. Uh are you were you actually live or are you wearing the same thing and you recorded it yesterday?

I'm actually live. I don't know why, but we do.

Yeah, it's gone.

I mean, we're big fan. We're big fans of live. I mean, it just allows you to be mo the most reactive to the most new information. Um, give us g give us the core thesis that you are trying to get across. I think that there are a few narratives out there. Um, we we've been enjoying the one that's, you know, this is a dominant consumer product. They just made it a better consumer product and people are going to use the product and get more value out. I saw a bunch of things in the presentation where I was like that's going to make my daily usage of of chatt better. At the same time, we're in this we're in this world of oh the models the numbers matter and the scale matters and this and that and this and that and uh and it's a it's a fine line and it's a dance and and we're in a transition phase away from benchmarks and away from talking about the size of the bubbles. But what was your core thesis like what did you want to get across to the listener? Yeah, I mean fundamentally I think from a research perspective, we've been working on reasoning models for several years now. And I think until now you've had this really clunky interface. You have to pick, you know, GBD40 or you have to pick 03. Um, and for the longest time, we've known that 03 gives you better answers across the board. It's just too slow, right? I mean, you often don't want to just sit there and wait for the model reason it out. So we've done a lot of work to push the speed, the performance of our reasoning models such that these can come together and work in a very seamless way. And so I think you know above everything we're trying to move the world into this agentic reasoning world. We believe that's the future. And on top of that you know you pointed something out which uh I really resonate with. Post training is a huge part of this release. We really wanted to highlight uh Max Schwarzer and his team who did a phenomenal job and they've made the model just really that much more useful for consumers, for businesses. It's a monster at coding. So

yeah.

Um on the on the speed of reasoning, you're obviously the chief research officer. um is are are you more optimistic about getting speed ups there from I don't know algorithmic design software optimizations or new hardware just let Moore's law carry on or find new AS6 or we saw Sarah Brris posting yesterday about um the incredible speed that they're getting 3,000 tokens a second on GPTO OSS um and I'm wondering what levers obviously we pull all of them but But but what what what path of the tech tree are we should we be like most focused around most tracking and uh and most excited about?

Yeah, I mean as a person who represents research I control the things that I can't control and I think a lot of that focuses on algorithms right simple algorithms that are scalable that we can pump a lot of compute into. Um

we also do care about the hardware improvements that are stacking up. um with the open source release you see thousands of people right um really kind of serving these models creating really great inference stacks and those are really great lessons for us to pull from you know how what's the ceiling of the speed in which we can serve these models

um what uh what can you tell us about the actual um like like user experience of speed um I was I I' I've just like last week I finally got to a place where for a lot of tasks. I'm I'm firing off a 40 query and an 03 Pro query.

I just have I have two tabs. Yeah. 03 tab.

Exactly. And and and I'm wondering um

uh what user experience patterns you think can uh help people balance between those? Is this like just something that we're like different patterns that we're going to learn over time or different uh or or or are there going to be certain problems of user experience that are purely solved just by better product design, better speed and we don't even need to learn these because I remember like you you know when you when you prompted a uh an image generator you used to have to say like don't no six fingers five fingers please or like don't make mistakes and now you know the models kind of have that baked in. Um but but but how how are you thinking about the user experience of getting the user um the results in the right amount of time?

Yeah. I mean this is one facet of why we believe so much in reasoning. It's just because all the scaffolding you used to have to give the model. All these small hints, they go away, right? Like the model can examine its own outputs. It can review them. It can be like hey look like I'm just counting the fingers here. Why are there seven? And um and it can kind of fix that, right? It does a lot of iterative generation. and does a lot of fixing things on the fly. And so we think one of the benefits of bringing reasoning to the world is really to kind of remove the need for scaffolding. And with GBD5, right, we know how clunky that that experience is with uh switching between 40 and 03. Actually, I mean, there's so many stories. Um I was just talking to someone yesterday, right? They're like, "Hey, well, you know, I've used 40 my whole life, right? It's the Frontier model." And I'm like, "Hey, well, have you tried 03?" And they're like, "Why would I try 03?" you know, three is less than four and so you

need to get out of that world. Um, you know, GPD5, I think it's a one-stop shop, reasoning and non-reasoning. Um, and we've really tried to make it kind of just parto optimal.

Yeah. Yeah. It's absolutely crazy to just take a bunch of letters and smash them together and expect people to pick up on that as a name or a brand. Chat GBT, TBPN, we're both kind of in the same insane gambit. But fortunately, it's worked out and I think people have have gotten over the hump. But

it rolls off the tongue TV.

Yeah, sort of. Except our friend David Sandra keeps flipping the letters. A lot of people do that. But at a certain point, yeah, you do break through and catch PPT has. But uh but keeping the model numbers simpler uh m makes a ton of sense. Um uh talk to me about the pace of play for research to actual product like

a lot of

and and on that note the line between your personal philosophy on the line between research orgs and engineering product orgs.

Yeah. I mean so our research operates on a variety of different time scales, right? We have teams that they scope out a bunch of ideas and then they start to kind of narrow in on um the promising ideas as they get closer to a run. Um and then you kind of see a winnowing of ideas as you get closer to launching a flagship model, right? And um there's always this kind of like uh um explore more exploratory to more kind of concrete and execution focused pipeline. Um and we're pulling on ideas across the board here, right? There's a lot of work in architecture optimization. Seb was on stream. He pointed out improvements in synthetic data. So there's really a lot of work that goes into creating one of these models. And you know it's hard to say like oh this model was about this breakthrough just because right now we have this machine that's producing breakthroughs on all these axes and um even across several paradigms. Right. So it's all that coming together that produces the experience that you guys feel.

Yeah. Can you talk to me about um the legacy or future of 4.5? Um I remember I was talking to you and I was like I haven't been using it a lot and you looked at me like I was crazy. You were like ah it's so good and I was talking to Tyler and he was like uh our our intern here and he was saying like yeah the people who really like understand how good it is use it. Um but but I was I was wondering is there a world where that is a tool in the tool chest for GPT5 in the same way that Python is or web browser is? And if if it detects that I want something with more more emotional pros or more thoughtful writing, it can do a whole bunch of research, collect a bunch of raw raw text, and then kind of do a 4.5 pass that I believe is more expensive maybe, and maybe doesn't make sense for every single query, but um could be a a feature in the loop or a tool that is pulled into the overall product experience.

Yeah, absolutely. Um, speaking of 4.5, it's also a very smart model, right? Um, and one of our bars in creating GPD5 was to make sure that on a lot of the axes we cared about that it was able to outshine 4.5. And I I think even in some of the soft ones like creative writing, I I think that um was the case and and that's what makes us so confident with with the name. Um, I think we're able to really rely on all of the architecture advancements, all the kind of post- training advancements, all the synthetic data advancements to create a model that's better than 4.5, but much faster and much cheaper.

Yeah, it feels kind of like we're I remember, wasn't the second iPhone called the iPhone 3G, and the number literally corresponded to a specific technology? And now when you get the iPhone 14, it doesn't mean it's 14 megahertz or a gigahertz or inches big like it doesn't like the number is abstract and it speaks to a bucket of features and it feels like there's I mean this was the first day of kind of re-educating folks on what the nomenclature means going forward. Um, have you talked about an annual release schedule or like or or because there's the iPhone cadence and then there's the Google cadence which was like Google search just got better every year for two decades. Um, I it it it feels like at a certain point you want to just be shipping as fast as possible. How do you think about the culture of shipping updates that you know you find something that feels like hey that could make the customer more delighted or the user more delighted and we don't need to do a big training run for it so let's get that out today um and let's tell people about it like how are you thinking about fast iteration versus splashy announcements

right so on the product research side I think it makes a lot of sense to think about you know what's the cadence of release and you know uh what are the feature sets that we want build and I actually think there's enough great research happening there that we don't have to worry about, oh, you know, is there going to be a drought or a long stretch without enough features to launch. But one thing that's important for us is to be able to provide the people doing the exploratory work some buffer from that, right? It's hard to do really great exploratory research in an environment where you feel pressured to do release after release after release. And so we let that be a little bit of a lazier pipeline. Not meaning that the work itself is lazy, but we give it space really to mature uh and to flourish. And you know once it's ready uh we can ship things across across that fence. So um that's kind of philosophically how we organize. We have a product research or uh still very much entrenched in the research and they care about the release cadence um and they're able to draw from all of the research that's happening um you know algorithmically and in scaling and in RL. Yeah. Uh, talk to me about tool use and how that's growing. I was I was kind of noodling on this idea that, you know, the I was I was thinking about the IMO and how uh it it at least from the reporting it sounded like OpenAI's model didn't use tools for that. And that's an incredible achievement, but it's kind of like artificial. Like I don't I don't care if the model doesn't use tools. I I use everything possible and uh even if even if an LLM can can memorize every fact, I'm fine with an LLM looking stuff up in a traditional database, spinning up a spreadsheet, like use whatever tool you want. Just give me the correct answer. Um but do we have is it important to give to surface to the user the variety of tools that are in the GPT5 tool chest? I noticed something magical happened when I was using GPT uh I was using 03 Pro. I sent an image in and I asked to estimate the height of a desk and it wrote like a thousand lines of of Python image interpreter and was like you know interpreting pixels and I was like I didn't even think to trigger Python. It did.

Yeah. Yeah. Yeah. No, he was right. It was crazy. But the the really funny thing was that it was just a standardized desk. It was just like it could have just googled like how how tall is an average desk or something. Uh or just memorized it. It probably was just already in the weights that it knows that a desk is like 36 inches tall. But it it did a ton of work and it still got it right. It fact checked it a bunch of different ways. But but but but I've noticed that now I can I I I can pull different things. Make a table. Don't make a table. M write some Python for this. Don't write some Python. And it kind of gives me the feel of like a super user to some extent. Um, but I'm wondering how you're thinking about what is further down. You like you've given chat GPT a computer as Ben Thompson said. You've you've given kind of the core tools, the Python, ripple, the uh the the the web browser. Um, what how are you thinking about kind of the long tale of tools that you want to bring to bear and how does that interface? I know that there's API integrations and all sorts of different surface area there, but give me some context on that. Yeah, I mean our reasoning models are pretty cute, right? I mean I think they um you know when you look at their behavior, right? They they know the height of the desk, but they'll still go verify it five different ways and you know it's all consistent, give you that median answer. And um I think that's really what makes these models so powerful. And when you think about tool use generically, right? Like we want the models to use that reasoning ability to just be able to like zero um a new tool, right? it. You should be able to kind of minimally get instructions about how the tool works and just be able to know how to use it, right? And humans do this all the time. You get a new tool, you start experimenting with it, and then you don't need too much scaffolding and you just go and go and use it and understand it. So, we want our reasoning models to use their reasoning to be able to use a broad selection of tools. And of course, there are a couple that you really do care about. You know, in in coding, it's very important for you to be able to execute code. Um, it's really important in personalization for you to be able to get context from your calendars and uh from from basically from the digital world. So I think there's a range of tools we want familiarity with, but beyond that, we want the model to be smart enough to just generalize and use tool zero shot.

Yeah. Talk to me more about personalization. I feel like um there's a world where I feel like I'm maybe underutilizing chat GPT as an app because I don't have it wired up to a non- relational database where it can just stuff data from, you know, it already has memory and it's doing kind of rollups and there's some sort of saving of context. But um I was when we were talking to uh Kevin Wheel, I was I was kind of like, well, like I don't really have like a GitHub repo that's active that I want to like dump code in regularly for like my one-off tasks, but for that image generation, like, you know, understanding the height of the desk, it's like, well, if I'm doing that a lot, maybe I want to have a tool built that lives in the world that my chat interface can can kind of interact with on an ongoing basis and contribute to and modify and and kind of wind up instantiating a piece of software that's like even more longived and then every successive query is even faster. So um yeah, how do you think about about different ways to increase personalization?

Yeah, I mean I think memory is huge. Um so we have we have teams surrounding memory and also personality. And when you look at memory, right, um I think it's just we have so much context built up about ourselves that the model doesn't have. And um our memory team's been really hard at work. You know, there's a surface level of just gathering facts about you, but there's also stuff about just kind of thinking very deeply about who you are, what your motivations are. Um and even you could think about, you know, you're you're trying to do some codebased tasks, right? You're a developer. Uh shouldn't the model just be trying code out, you know, um and and just kind of leveraging all that memory kind of its thoughts about what you want to do to just help you kind of be doing work all the time? So, um, yeah, we do think memory is a huge part of making the model more personalized to you and it should just makes make use of all that passive signal about you that it that it observes or all of that interaction and and just help you accomplish your goals.

Got it.

What do you think it'll take for AI to start making novel discoveries? that's been a critique over the last year is everybody's so excit everybody's using these products every day and in their work and life and yet uh it still feels like we're missing that. Uh Dwar has talked about you know potentially that being around continual learning but I'm curious what you think.

So one thing to underscore is I think the models are already phenomenally creative in certain ways. So uh when I've looked at our performance on on contests right um you know I' I've done these contests before sometimes you have this mental classification of uh these problems require more creativity or these ones require less and one of the big surprises for me was that the model can get some of the ones which I intuitively think require more creativity um and you know it often does come up with these solutions that I consider quite ad hoc and really don't pattern match to anything I've seen before. Um when you look at you know advancing science or mathematics or field like this one thing that construct in which humans work sometimes is uh there are kind of theory builders um in mathematics for for instance there are uh mathematicians whose role are to kind of build out this theory and and almost to kind of create um you know Olympiad style um uh subpros which uh often other mathematicians who are very good at that kind of style of work can do and I do think kind of the model will increasingly contribute on that side first right if there's some mechanical like hey you know I I really don't know how to simplify this expression I really don't know how to like get get this result um it can really do that quickly for you um we're trying to increase the envelope such that the models getting towards that theory building side and you know being able to create uh creative hypothesis and um all these components are very useful for what I consider the ultimate goal, which is being able to automate some of our own work and our own research.

How are you thinking about like the the layers of mixture of of mixing? Like I remember GPT4, I don't know if this was ever confirmed, but mixture of experts model. This is kind of like widely understood in the industry. Um, now are we in the era of like a mixture of models that have mixture of experts? Like how many mixtures are going on? How does GPT5 actually work? Is there a uh is there a taxonomy or or architecture diagram that you can kind of like walk through to explain what GPT5 is because it feels so much different than GPT3. Mhm. Yeah. I mean um one of our probably the pinnacle of our research roadmap and our path to AGI um when you look at the levels of AGI the top level is uh what we describe as organizational AI and what this means is you know uh collections of agents working together often like we might in a company towards a shared goal right and you would imagine that these agents probably subspecialize in ways maybe similar to what humans do maybe in their own more efficient ways. um and I think you know effectively work together to accomplish some goal. So we very much care about exploring this vision seeing if that's much more effective than you know one single big brain working on a problem and I think there are reasons to think why it could be so and um and yeah I I think that that is one of the things that we're after. On that note of specialization, um how are businesses working with GPT5 or how do you expect them to work with GPT5 in terms of uh coming to OpenAI and asking for special capabilities or or or fine-tuning or you know any sort of RL on this particular problem in my world. I have this specific data set. It's not public, but I I want a hyper I want you to benchmax on it. I want you to I want you to get a 100% on on you know the gas station bench or whatever. Um you know if I'm if I have a certain business and and I'm I'm willing to invest in sort of some some overfit RL because it will create immense economic value for my business or it'll solve some fundamental problem. Um how can how how are businesses going to be using GPT5 over the next few years?

No, that's a great question. So I I think that um this is a chance to kind of highlight one of the the results that we've accomplished over the last couple weeks uh which is our ACT coder results. So this is um a relatively unknown programming contest but it involves really the pinnacle of the the best coding contest contestants in the world. Um and what they do is you know they're put in a room and they have to solve an optimization problem. This is something that's actually very real world uh relevant. So you can imagine an optimization problem as something like you know what Uber might have. uh you have let's say riders and you have drivers and you want to kind of create a system where you match them as as quickly as possible you know um with uh you know the least amount of cost for for instance and um and so we've really created a system that can solve optimization problems at the level of the best in the world right and uh these truly are the kind of the best heristic solvers in the in in the world and so we have an organization led by Alexander Madri he it's called strategic deployment and what they do is uh for a select handful of customers who really have that you know beefy problem that that they need to solve um to just go and provide that value right and um I think there's a lot we can do there I think there's a lot of very very valuable optimization problems in the real world and um we're really excited to partner with with people because I think um this creates a template for directly having AI provide economic IC value and and really catapulting certain industries forward.

On the on the research side, um what uh what unique advantages do you think you and your team have given your position in the market with the incredible user adoption and the incredible usage from those users? It's not just DAUs but it's actually the number of queries semi analysis estimated at like 71% of all queries going through chatbt. Uh what advantages does that confer from a research perspective?

Yeah I mean a lot right and I think um you know it allows us to kind of deeply understand use cases. It allows us to understand the frontier of where humans are, you know, kind of finding value, where they're not finding value, which areas that we need to improve the models on. Um, it gives us a lot of signal into, you know, how users are deriving value, when they derive value. Um, and

what is that signal? Um, like I I see the thumbs up, thumbs down button. I I I'm sorry, I don't push it very often. I'm not doing my job apparently. But I know that you can figure out whether or not I'm satisfied. Just stop booing me, Jordy.

That's the research team.

Okay, Mark. I promise you for the next 100 Chad GBT responses, I will I will be honest with my thumbs up, thumbs down just to help you do it. We have tons of people luckily who do.

Oh, that's great. Okay, so you do get a lot of thumbs up, thumbs down. Uh and I'm sure I have done it occasionally. Um but I but I also imagine that there's a ton of other signal in there. um you know with the Tik Tok algorithm or you know any social algorithm it's very easy time on site but with chat GBT obviously it's exciting when we hear okay 30 minutes a day or some rumored number of of minutes it it feels correlated with usage it feels correlated with value that's being delivered uh you can obviously look at churn metrics and all that stuff but what other what other pockets of signal are you finding are you finding people just I mean I I remember the story about Google where they were trying to figure about how to handle like misspellings and create the the definitive database. Do you know this story where they were trying to develop the definitive database of how to spell things and they were like taking a bunch of shots at it and they figured out that the the best most rich source of data was just if you type in financial into Google and you misspell it oftent times then you will just correct it yourself and the second query you send will be will be spelled correctly. So they can just look at two similar queries. What's the second one? That's the correct that that's the correct spelling. So yeah, what other pockets of signal are you finding that are translating into the research environment? What are you excited to go deeper on?

Yeah, so I'd love to first talk about the DAU signal because um

I think um you know that's something that a lot of companies track, but we find actually a lot of danger in tracking it too closely. And one of the recent blog post we pushed out was one on sophincency, right? If you just, you know, hey, we're going to boost responses where uh users say thumbs up, you know, it

creates a condition for a model.

I just want to say, Mark, I love everything you're doing on this front.

Yeah. This entire interview has just been fantastic. You just

We'd love to have you back on the show tomorrow.

You're But clearly problems with that.

Yeah. Yeah. Clear clear problems, right? the model just starts kind of sucking up to you and it saying like, "Hey, you know, you're right." And even in complicated situations where I think objectively, you know, collectively we'd be like, "Hey, this person's in the wrong." The model starts saying, "Hey, you know, you're right. You know, the other person's gaslighting you. You know, this other person's kind of and and

and people deal with people deal with this in in the real world. They'll go to a friend, they'll tell them about a situation, and the friend will give them advice, but maybe it's not the entire it's not the fullness of the situation, right? Maybe they left out some key facts and the friend is like, "Oh yeah, that other person definitely is in the wrong." And they like skipped over some important details and

Yeah. No, no, exactly. Exactly. And we don't want our models to fall into this trap where it's just trying to get you to like uh you like what it says. Um and and so, you know, we wrote back a lot of changes that produce that kind of behavior. And really the way I think about daily active users today is we need to be opinionated about the features that we build into the future. I think we have we have a lot of ideas here but we have to let that drive um you know build for the future. Build for the things that people you think they'll want and maybe don't want necessarily know they want necessarily today. Um and then use DAU as kind of this byproduct right a way to track that you're on the right right right track here. So

um yeah I mean we we want to be careful here. We don't want to fall into these traps of like you know 3 4 years from now that this turns into kind of engagement bait or something.

Yeah. Was it uh how how much time have has the research team been focused on efficiency specifically? It felt like summer was a a good window before kids come back to school and start you know maxing out query is a good time to increase efficiency and and uh I know uh the cost of GPT5 uh have

every time there's a new model I'm like this is the best it could ever be it's good enough bake it on an ASIC I just want it for free and I want it like in milliseconds but but that's just me being you know grumpy I guess

we we've done a a lot of work we've been building out our teams We focused a lot on scaling. I think Greg's going to come come on a little bit later and uh he's been spearheading a lot of that work. So, um yeah, no, honestly, it's become a bigger and bigger focus for us, especially in the last couple of months.

Um on on the I mean, this is somewhat related to the sickency thing, but uh I'm interested to know like what do you think is driving like the GPT tone? You know how like the M dash is a thing and then the the it's not a newspaper, it's a way of life and it's like the there's these like little like uh like flourishes like that that come through and in kind of a tell that it was written and in a lot of ways I love it because when I get a deep research report I like that it's using the same Wikipedia style tone. Like I want consistency there. I don't want it to be like oh this today it's looks like it's a Vice News article and the today it looks like it's written by someone at BuzzFeed. I like that it's consistent in many ways. Um but but why is that happening? Do do you think that uh bigger models like 4.5 kind of were able to solve that or do do those kind of like uh local minima like I don't know like wells happen even in bigger models? Is there anything from a research perspective that can that can stop GPT having its own voice or is it fine that it has its own voice?

Yeah. Um that's a really great question and I think you know as you scale up models as the models become more intelligent they kind of have a just deeper innate understanding of tone right and so you expect that to improve just naturally as you make the models more powerful bigger better reasoners but one thing that I think gets lost a lot is

each individual company has a lot of impact in terms of how they shape the default tone um and you know we publish a document called the spec it kind of lays how we expect the model to sound in certain cases. Lays out a lot of examples for that. And I think we use the spec in many ways, right? We have people come in and see, hey, uh, is was this thing generated in accordance with what we would hope to generate from from our spec? And this is a living document, right? It evolves over time. And so I think, you know, um, each company kind of has a very opinionated take on what they think the model should sound like. And it's not an accident that the models sound a certain way. Um, I I don't think just naturally every company is going to train the same kind of voice into their model.

Totally. Well, thank you so much for hopping on. Congratulations on the big launch. Uh, we'd love to have you back soon to talk more. We could go in a million different directions, but we'll let you get back to it. We know it's a big day. So, have a great rest of your day. It was a great conversation.

Talk to you soon.

And we will tell you about reream one live stream 30 plus destinations. Multiream and reach your audience wherever they are. This stream is made possible by Reream. OpenAI just did a live stream. If you're trying with Reream, if you're trying to do a re if you're trying to do a stream, you got to get on Reream, so it's everywhere. And we will bring in our next guest, Greg Brockman, the president of OpenAI. And

← Back to story