Jeff Huber on context rot, RAG vs. long-context windows, and AI adoption curves

Jul 14, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Jeff Huber

Um, but we have our first guest coming in the stream. Welcome to the stream. Jeff, how you doing? Hey guys, great to be back. Thanks for having me. It's great to have you. For some reason, I'm missing it. Let me see. I think I lost John's not getting audio. Oh, are we good? Are we good?

I'm not getting anything right now. No. Can you can you uh you got me? Uh I got you loud and clear. Uh how are you doing? You had a busy busy uh we haven't put up a traded graphic for you and the team yet. But uh and I'm almost surprised, but I I really should publish an address.

You know, I think they keep uh sending to the wrong spot, so that's okay. Yeah. Um yeah. What's going on? What's going on? Give give us the update since the last time you've been on the show.

I was checking the checking it and three months ago uh we chatted and actually the question of the day was like what's going on with Zuck, what's going on with Llama for you know where is this going and uh you know I don't know not sure if Zuck is a follower of the show but like you know I did go back and I said I wouldn't bet against Zuck at hundred billion dollars of free cash flow per year and uh you know I think that's panning out to be pretty true.

Uh so yeah that was a great prediction. Uh how did you react? Did you catch the news this morning uh that first hit threads and then it was being shared over on X as well around the new data centers? Listen, I mean, you know, it's hard to ignore a giant blue box on top of Manhattan.

You know, it's like uh pretty freaking epic. You know, the technology bros love to see it. So, it's a good graphic, right? It is. Yeah. Very. Great. I Yeah, I thought it was a great visualization. Um I guess the uh the other question is yeah, just like what was your reaction to the wind surf news?

We we we we were chopping up on it. Did you go through the emotional roller coaster? Did you never lose faith? Super excited to see what's Yeah, I'm super excited to see what Scott has to share here in a few minutes.

Um uh yeah, I mean, you know, obviously all kinds of weird distortions in the market from, you know, kind of former FTC policy. Obviously, hope that gets like cleared up quickly. Um I think it's a big gift for cognition. I think it makes a ton of sense they kind of needed like a horse in that race.

And uh you know while yes agentic coding is part of the long-term play here like you know you want to you can't go straight to streaming you kind of want to start with the DVDs you know and so with cognition picking up um wind surf I think they're kind of buying that DVD business and uh going to allow them really leaprog to the agent stuff in the future.

Um so makes a ton of sense excited to see uh what they build. Okay.

The like where we left it as like the potential risk of these like sort of zombie go ship aqua hire situations is that if a deal happens and we get kind of the bad ending where there's 80% of the workforce, 90% of the workforce, couple hundred startup employees who've been working hard for maybe a year um and they don't get a good outcome and they're not happy about it.

the it could have a chilling effect on startup hiring and those folks could wind up just saying, "Hey, I'd rather just a stable bet in big tech as opposed to a lottery ticket that even if the numbers hit, I might not be able to cash it in.

" And so my question for you, you have to you compete with hiring people from meta.

The question I have is do do some of these later stage or even you know you know big tech companies themselves start using that as a recruiting tactic to be like look like you know it's very possible that this company gets acquired in the next year and it's unclear you know I mean they already do like these RSUs are liquid like they'll just show up in your account you can just cash them in but uh but what is your take do you think it makes it harder for you to pull people over from big tech into the startup world.

I think you know the true believers in startups you know are oftentimes you know yes they're buying a lottery ticket they sort of understand that but ultimately you know the analogy I think of is uh yeah ships in harbor are safe but that's not what ships are made for and I think like you know the best early stage founders and hires just like can't uh they have to live a certain way you know they kind of have to hit the high seas and uh see what they're made of um you know that being said like I think obviously as startups start to scale and they're hiring the 20th person, the 50th person, the 100th person, the 400th person, like that math really starts to make a lot of make a lot of of a difference.

And yeah, absolutely. I totally believe that the recruiters, you know, at the big tech companies are already weaponizing this um to to bring people in the door. So yeah.

Um, what about uh the uh I want to know more about how you're seeing the hyperscalers pick different parts of the stack to plug in as like as they build out like the AWS dashboard because it feels like with like I wouldn't have necessarily expected like IDE to be one of the places where potentially like AWS needs play they have uh Microsoft Azure has co-pilot now Google what what was the one that they launched that was like jewels but now they also have the wind surf team or some of them and like they might launch something that's that looks like you know uh like a piece of software that's not at the API level above the API level um what what other categories of like AI tooling rappers that type of stuff do you think big tech is looking at or what do you think they will like build internally?

Like I feel like you have some experience here. Exactly. Where the value capture ends up landing is a big open question. I think everybody's nervous. Um you know the SAS companies in the middle are nervous. I think also the model labs are also pretty nervous about this.

You know nobody at this point wants to like have any too principled of a stance. You know they'd rather have uh sort of an egg in every basket and however it hatches they'll come out on top. Um, you know, I think that like obviously yes, like coding has emerged.

It's for one of the main one of the largest use cases of AI today and it makes a lot of sense. It's a highly deterministic, highly controlled environment. Um, it sort of makes sense that people who are using these tools sort of at the edge also are the ones applying them to more software.

And so it's a great use case for AI. Um, in retrospect, it was sort of obvious I think that like it was going to be a really big deal.

um you know where the value capture really shakes out over the long term is anyone's to say but I think like you know you can in some sense you'll never regret owning the top of the funnel and having a direct relationship with an end user and being in the default um in a given category like that's Lindy that goes back you know a millennia right you want to be the default you want to control the direct relationship with the end buyer and you want to own the top of the funnel and so in that way like you know you may be pretty bullish about your models um but does that mean you shouldn't have an IDE play probably not.

You probably still want an IDE play because you know you're not going to you're not going to bet the bet the farm on uh the models winning the value capture at the end of the day. So yeah, how how bright is the line between engineer AI engineer and AI researcher right now?

because it feels like we've se the original narrative around a lot of the trade deals was like AI researchers can get a $500 million training run unstuck and that can immediately unlock billions of dollars of value, save you tons of money on inference. Like the math just works out so so clearly.

I feel like we've always had this divide between I mean to some to some degree we've had a divide between like DevOps and and product engineering but the bigger brighter lines have always been between like engineering and sales or engineering and operations but is there a new but like dividing line between like who's an AI researcher versus who's an AI engineer and are are companies like gaming that yet with like I I always see like the member of technical staff and I don't really know what that means.

I' I assume if you work at OpenAI you're smart and can do a lot of stuff but like how how niche is the skill set and how much of a dividing line is that? I think the skill set is still quite different.

Um you know if you're working at a large lab on a research team you know you really are concerned about primarily training whether premid or post training um you know you're primarily looking at loss curves you're primarily thinking about you know what benchmarks you're going to be exceeding at.

You're thinking about what data sources you can capture to continue to train and fine-tune on. You're experimenting with new approaches to reinforcement learning or other to like try to, you know, sort of seek out incremental gains and be out your competitors.

If you're an AI engineer, you're really thinking about like how do I apply GPD4 to legal? And it's really kind of a different shape of both like skills and also folky and obsession.

Um so you know that's not to say that like in the limit like those these things won't sort of converge or won't merge to some degree but like I think today there's still a pretty bright line between the two. Yeah.

What about um the uh I guess the question is like deep mind seems very different from the llama organization in the sense that like the Gemini models are truly frontier and but the but the problem with uh Google's AI strategy feels like it's on the product side where not that many people are downloading the Gemini app or a lot of people are it's like millions and billions of hundreds of millions of people but like but like you know market share on on chat apps and like who has it in their home row like chat GPT is so successful kids just call it chat and Gemini is not really at that level yet and so yeah the the the question is like is like the problem that Google's trying to solve feels like a different problem from the problem that Facebook is trying to solve and the question is is can you acquaire and and and acquire talent that actually solves that problem or is there some sort of like underlying structural problem of like how big Google is that is actually slowing down product development.

My my fear is that like you get some you get some great product engineers in but then you're still hamstrung by Google's scale, Google's HR department, their legal department, how they think about, oh well like we can't launch this unless it's internationalized on day one because we're Google.

And so all of a sudden you have these great people who have been able to like go zero to one on a product, get it really viral and like create a product that like people love, but they can't do that in within the confines of Google. So um yeah, what do you think about that?

I mean and the the potentially more specific question is are you betting on Google plus windf or or cognition plus windsf if they're going after the same, you know, developer sort of engagement, right? If if these if these you know if if the products effectively converge. Yeah. Yeah. Yeah.

I mean like Conway's law the idea that you ship your org chart remains quite stable and true and you can kind of take that even one step back which is like you ship your culture as an organization. Obviously the org chart is one manifestation of your culture.

Um, you know, I think that if if Google leadership does not give these this new windsurf or half windsurf team like a really long leash and a ton of autonomy, like you know, there's going to be some level of like a, you know, imuno response and they're going to throw them back up again.

Um, I think you really have to give those folks a long leash and that has to come through the top. It has to come from the top, right?

has to be um you know either founder or seauite level where they're getting the green flag to like really run hard and ignore stuff like internationalization um and you know maybe you still have to get the legal sign off but like you get a dedicated team of legal people you know at your beck and call to unblock you whenever you need to um yeah I won't I won't too early to you know bet on the horses here so I'm not sure I can bet on the horses yet but even even like the uh like the ADA rules like accessibility rules like when you're a startup you move pretty quickly and There's still you still have to do a lot of the accessibility work, but if you're like 90% there, it's like if you mess up and the screen reader breaks on one page, you're like looking at like a $5,000 settlement to just say, "Hey, sorry, like we messed up.

Like one person couldn't use this and like we feel bad and we paid you and we're good. " But if you're Google, it's like there's going to be an army of lawyers who are like, "How can we how can we squeeze them on this? Get them to settle.

Get a there's just more economic power gets and that and that just naturally slows things down. " Um yeah, it's very interesting. Um how are you uh what are you expecting out of open source uh broadly uh open source models broadly in the next couple months?

Uh we saw OpenAI delayed uh their launch uh just basically saying like the the the takeaway was effectively not quite ready, need a bit more time, want to ship something that we're super proud of.

At the same time today, there was some reporting that that uh and and unclear if it's true yet, but reporting that says maybe uh maybe Llama will uh shift away from from open source in general. So, I'm curious um uh generally uh excited for OpenAI's open source model, but uh what what are you thinking?

Yeah, it's a great question. I mean, China obviously continues to bring the heat on open source.

um you know regardless what we think of uh the CCCP and their motivations like just yesterday over the weekend the KI model came out already doing incredibly well on benchmarks a bunch of developers and founders that I know are pretty excited about that model and so um I think that like at the end of the day like you know if there is some belief that like if you have a very good model that it could sort of run away and eat an entire market um you know if so facto you're fighting against everybody else who doesn't want you to be able to do that and so I do think that like to some degree the margins will be competed way I think that more things will be open source in the future and I think that also you know if you think about like the workloads we don't use supercomputers to write our email um we just need basic at home stuff it works just fine and I think the same model you know same idea can be extended here you know while the frontier models are going to be develop you know kind of developing and and figuring out new science and uh hopefully uh letting us don't die right we'll see um I think that you know for a lot of like business use cases and consumer applications.

Honestly, the intelligence bar is just like not that high. And the capability overhang that's already in the models today continues to be quite large as well. I think there's a lot of, you know, people joke there's hund00 million inside your laptop. You just figured out how to unlock it.

Well, there's a couple billion dollars inside of these model weights at least. And you know, we got to go unlock it. So, yeah. What's your like temperature check on just like overall progress stagnation? Like are we are you feeling the acceleration? Are you feeling a deceleration?

Dwarkesh last week updated his timelines, kind of pushed them out. Still very aggressive timelines. Um, but what what was your reaction to that?

It feels like I'm seeing a few memes about like no one knows how to scale RL or like, you know, yes, we're seeing impressive stuff on ArcGI from Gro 4, but keep in mind we went from like 7% to 14% and this is a test that kids can do.

And so it feels like everyone's saying like we need new research, we need new breakthroughs, we need new paradigms.

what's your overall temp check on timelines and just like yeah I mean listen I think that like AI continues to be really spiky and it's really hard to in it where it's strong and where it's weak and it's you know we want it to be strong in all the ways that human intelligence is strong we also want it to be strong in all the ways that human intelligence is weak you know sort of project all of our hopes and fears and dreams like onto these things um and you know I think in many cases they're still not there um just today uh Chroma released a technical report on the topic of context rot And so the question is basically like okay this model's got a million tokens but does it like what can you actually do with those million tokens?

Needle in a hay stack is certainly one benchmark but we demonstrate across a sweep of other approaches that actually these models like they can start to fall apart really early like way earlier than you might think.

You might see reasoning performance drop off even around like 10,000 tokens or earlier and 10,000 tokens is like not that much.

And so, you know, like I don't, you know, at least what I want as a builder is I don't want a model that has the 10 million token context window, but kind of maybe sort of work sometimes like I want a model that has 60,000 tokens, but is a is per perfect at paying attention to that model to that context and perfect at reasoning over that context.

And so, you know, we hope this report will both help developers understand what's possible to build today, but we'll also create like a new set of benchmarks to the model labs who care about um because this is ultimately what people who are building care about. Okay. So, I mean that seems like victory lab for you.

Give me the update on Chroma Rag and uh you know because it's like ju just to kind of set the the base case like there was a lot of like oh like oh rag's going to get one-shotted by just huge context windows. I saw the million tokens and it was like that's a lot of stuff.

I feel like I could just stuff everything in there. And then I was talking talking to Doresh about can we go even larger?

Can we get to a billion token context window and and he was making some of those points, but he was also just saying like the like yeah the inefficiency like it scales quadratically and so just getting to a billion tokens is not a thousandx harder, it's like billionx harder or something like that.

Um but but but give me the update on like um on like rag as a as a research effort what Chrome is doing in terms of a product and then some of the actual like use cases when this works where this breaks down like how it fits into the overall puzzle. Yeah, the big question has been is long context all you need or not?

And um you know maybe it will be someday. Um so I've not I don't have a time machine. I've not been in the future but I don't care about someday.

I care about now and I care about the community developers that we serve and giving them good information so they can really reason about what is this stuff, what is it useful for, how can I mix it together to create value. That's what we really care about.

And so I mean the research was not really commercially motivated, right? I think like no good research is commercially motivated. You have to go in with just like a question and you know a thesis and you'll just see what happens. Like you can't go into it with a bias. Yeah.

Um and I think really you know the bias from us came from intuition from us building stuff. We noticed that actually once models start crossing these certain like thresholds, stuff really starts falling apart. And this is what you talk to anybody building in the space, they'll tell you this firsthand.

This is their intuition as well. Yeah, like million tokens maybe, but like it really starts to fall apart after a certain number of tokens. And so we wanted to like really just try to quantify that and reason about that and put numbers around that. And so we start to think about like what are the solutions there?

Well, one sol solution certainly is to use retrieval. Obviously, you know, we are biased in that way. We think Chrome is a great tool for that. Um, we're also seeing increasingly people use reranking in the loop.

We're also increasingly seeing people take like a large context window, say like 10,000 or 100,000 tokens and breaking it into many small LLM calls because the the fewer the context, the better the model can reason.

And so that sort of process of doing sort of a map produce of like splitting it into many pieces and then having these like huge army of parallel small fast cheap LLMs like process that context is also I think emergingly a best pattern a best practice. Um, but you know it's still early.

I think that like we didn't really want to go into the technical report either with like here are all the answers and we don't know all the answers. We want to motivate like here's the problem as we see it. Yeah.

Well, speaking of other like research and technical reports, uh what was your interpretation of the meter report that suggested that uh code tools were like cursor and windsurf were actually potentially slowing down a set of in of developers.

It wasn't a huge study, but it seemed like a shocking result in the sense that it was people were expecting a 20% increase and they got a 20% decrease roughly. Um, and and and one of the developers who was actually in the study chimed in as well and was like, I I think I know what's going on.

I'm like leaning on this tool too much. I'm doing a lot of waiting for stuff and so maybe some of this just naturally cure cure uh is cured by speed. But uh would love your take on on that report if you read it. I mean, we as society are going to figure this out the hard way, right?

we're going to figure out like how much of our uh sort of reasoning should we externalize to AI and how much of that reasoning that we then have to integrate to process what it thinks right um and uh probably also there's also news over the weekend about you know open AI making you dumber or chat making you dumber and kind of headlines like this and of course the exact claims you always want to dig under the hood and look at the actual study design and their actual claims versus the salacious headline but um I think it just speaks to how early all this stuff is obviously there's like a ton of excitement people wanted to change the world.

Um, and you know, I think exactly how it gets integrated into our workflows over the medium and long term. Like, you know, it's the classic thing where like over the next year it probably won't look like that that much, but over the next 10 years it's going to look totally different.

Um, and so yeah, we'll just wait 10 years and see. But yeah. Yeah. talk about uh just just condensing down information in the context of uh of rags specifically, but also just the other the other techniques for this.

Um what what I'm interested in is like Andre Karpathy recently was kind of thinking through what the future of reinforcement learning could look like and he draws on this very interesting uh story that I think most people will be familiar with which was that for a long time if you went to any LLM and said how many how many uh instances of the letter R are in the word strawberry it would get confused and it would give you the wrong number and this was because of tokenization each piece of the word is broken down so we can't really see the letters.

And so the solution was that they they kind of baked in some sort of like system prompt apparently that said if a user asks you to count letters, split it into individual letters and iterate through it one at a time. And they just kind of like figured out the cheat code for that particular type of problem.

The problem of course then is that like humanity's job is unbounded and there's a billion different versions of that that come up all the time in the context of work and like that's where you have all these little edge case failure modes and that's probably why uh you know I still have to book my own flights or something or you need a human to book your flights because using all that stuff there's just too many edge cases.

So his his solution was like we need to develop a reinforcement learning paradigm that goes and finds those hard one lessons and bakes them down and saves them and that feels like something like compressing of information because you're not just going to want to stuff all that into the context window again.

You're going to run into that problem. So so just talk to me about your reaction to Carpathy's post and then just kind of how how how rag might interface with the next models whether on the training side or on the inference side. Yeah, for sure. Sure.

I mean I think learnings from the system running can be integrated both into the data systems that underwrite that system can also be integrated into the weights. Now whether that's through RL or different you know means yeah is sort of going to be dependent on the use case.

Um one way that we've been thinking about it is we're calling this the the inner loop and the outer loop of context engineering. So if your context needs to be tuned to do a good job, context fraud is real. It motivates the need for context engineering.

Um and this is the job that many application developers are doing have been doing now for years is like solving that problem of what should be in the context window right now. Mhm. That's the inner loop. The outer loop is how do you design a system that will get better at the job of the inner loop over time.

So how do you capture signal from your users? How do you capture feedback from agents? How do you capture all that data? How do you analyze all that data? And then how do you bring that data back into that inner loop problem?

That could be obviously number one massaging and changing the knowledge and data your model has access to as known as retrieve augmented generation. Um or it could also be you know applying it to weights in different stages of the pipeline including the LM itself, the embedding model, re-rankers, all the above.

Um and so I mean this concept we just sort of put out like a week ago.

um we gave a little short a short um kind of conference at RAMP's office in New York on context engineering NYC listen you got to do it you got to do it and uh and uh and this is the first time I've really seen this posited in such a clear way and so you know okay we don't have all the answers but I think this is the question people will come to ask in the future is like not only how do we build a system which works today but how do we build a system that gets better over time automatically and that is like the outer loop of context engineering makes sense.

Uh I I want to get your how you're thinking about Grock. Last week was an intense week from everything from uh the fiasco. I think it was on Tuesday uh with with the Grock bot just going wild. Maybe it brought more attention to launch Wednesday which was extremely impressive.

Um but then later in the week, you know, people were realizing that like the reasoning model was uh or or like part of the kind of reasoning loop was like kind of making sure that uh it understood the boss man's views uh by by you know what is what is Elon finding the coverage and then reinforcing that wrongly and not so as I went into the weekend I was thinking this will be interesting to see how Grock can do in the enterprise and then this morning they announced like a $200 million deal with the government.

So, I think that's a a good stamp of approval. But I'm curious like kind of what insights you have around, you know, some of the basically like larger potential customers and how they're even thinking about model selection for for you know, different products and applications.

Yeah, you know, I think you could probably wait. I think actually again ramp plug has like the graph of like you know anthropic and chat GBT or anthropic and open AAI revenue growth. Um you know let's wait to see when GROC makes a dent in that. Um that's not to say they will or they won't. I don't know.

Um you know there are these headwinds that you're talking about which is uh you know you could either see as a a feature or a a bug depending on which way you look at it. Um I mean you know there's things to like there's things to not like. I is kind of hard to predict the future here.

I guess I guess what I want to know is like what is the what is the flip side? Maybe we should just talk to Ara at RAMP about this, but what's the flip side of that chart? Uh we know who the sellers are, OpenAI, Anthropic, Gemini, and then Grock. And there's like kind of a long tale. Um but who are the buyers?

Because I see uh individual like proumer knowledge retriever I would pay $200 for a better Google types. I'm one of them. Uh then there are the developers who are paying for developer tools. But then what is like the third bucket of like buyers for tokens right now?

It seems like it's a lot of companies with like kind of point solutions little I'm stuffing an LLM in my SA enterprise SAS speeding something up cleaning up some data but do you have any insight into like kind of what's the third wave that's coming that everyone listening to this should start investing in now?

No, I'm just kidding.

I think you're right that a lot of the spend while a lot of the spend is consumers um you know a lot of the spend is also businesses enterprises are using it for really boring stuff process automation broadly um you know those buyers you should expect to be like way more conservative about like the tool they pick um you know the adage of uh you never got fired for hiring IBM definitely also applies here and so sort of reasoning about like who is the IBM of this space and why I think will like help you determine uh where that spend is going to go.

Yeah, that's a great question because from a from like a bra from like a being on the frontier as a brand I would say OpenAI but then from like a you know scalable infrastructure I would say GCP and Gemini but then from like a model agnosticism view I would say Microsoft. It's a tough question there.

I I don't know if that IBM has emerged. Maybe you maybe you have a strong opinion. Maybe Jordy does, but uh overall I mean great great great catching up with you as always. Yeah. Anything else, Jordy? No, congratulations on the on the launch of the paper. Yeah, I mean yeah, very exciting results.

So, thanks for thanks thanks for funding it. Good to see you guys. Have a great rest. We'll talk to you soon. Cheers. Take care. Uh we need to tell you about Wander. Find your happy place. Find your happy place.

Book a wander with inspiring views, hotel great amenities, dreamy beds, top tier cleaning, and 247 concier service. It's a vacation home but better, folks. And if you're looking to take a vacation to the Goodwood Festival of Speed, you might run into Duual Lia. She says, "I love speed. " I love it.

Uh she's really a driver. Like this is not a large. She wasn't driving the car to be honest, but she's into the car. She's not just hanging out and the livery was absolutely fantastic. Uh just to give everyone some context. So we have Scott Woo from Cognition and Jeff Wang from Windsurf joining any minute now.

where the studio crew is getting ready for everybody. We