Standard Intelligence trains computer-use models on 30fps screen capture, demos self-driving with 50 minutes of data
Feb 24, 2026 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Devansh Pandey
maybe it turns into a spray and prey fund. You never know. Maybe Leopold says, "Yeah, I'm just going to write five million dollar checks to every company." Who knows? Anyway, uh we have our next guest in the re room. We got Standard Intelligence. How are you doing?
What's going on?
Hey, uh I'm the launch. Um doing pretty well. How about you?
We're doing fantastically. Uh thank you so much for taking the time to come on the show. Since this is the first time on the show, I'd love an introduction on yourself and the company.
Yeah. So, I'm um co-ounder of Intelligence. Uh we pre-train computer use models basically. So basically the thing people are doing is they're training you know on screenshots um and like train of thought traces and we're just like what if you train purely on 30 fps video. um what actually goes into the training data because there like there's a lot that you can do on a computer and I feel like if you've never trained on Ableton and it just comes randomly like are you actually going to be able to learn Ableton from just playing in Premiere Pro and Word and you know Paint or something or Photoshop or whatever like h how are you thinking about the transfer and like what's actually in the training set actually just zoom out and talk about the process in more depth.
Yeah. So we have like two splits of data where we have like you know this small like contractor split. Uh the thing that we did was we like made this app uh that people run on their computer and it records their screen and you know logs all their key presses um and and all their mass movements and we're running that all the time. And then we also have this like much much larger uh kind of unlabeled data set of basically every video that we could possibly find uh that we're allowed to use on the internet of computer use. Um, and so yeah, we we trained a model to to label that big set from this like small contractor only set. And the goal is to just like be able Yeah. just like train on all of it and like train this kind of like general model that is that is able to to generalize to basically anything that you could do on a computer.
Are what what kind of limitations do you have on on who can install your software to capture that data? If I'm a if I'm a company, I feel like I have to have a pretty high degree of trust in in you guys to let my employees is that something that that's like a is more of like a partnership. Uh how does that work?
So right now it's like us like we're recording our own screens all the time plus like we have some number of contractors and we get to like pay them you know somewhat less because they're not doing like active work for us. It's more like passive screen recording.
Sure.
Yeah. Uh, and then are you sitting on top of some sort of foundation model brain for reasoning chains and sort of like the LLM piece of the puzzle or is this a model kind of lives? You're not right now.
Not we're not at all like the the the model that we released is or I suppose like demoed is like
entirely trained on this kind of 30 fps video in and like you know typing and mouse movements and things like this out.
Okay. So, how much is that again?
How much longer will I have to fill out forms on the internet? I've I I should try to estimate how many times I've entered the same information just over and over and over and over. And John John Collison
John Collison talked about it on our show today and was talking about it on his own show, I think, with Ben Thompson talking about like at what point can you just take a link and say like, "Hey, please buy this." Yeah.
And then it just does it for you. like pretty soon. I think like that kind of use case is just like under six months away depending on what like exactly you mean.
Yeah. I I mean uh in terms of actual deployment I imagine that uh this would be something I I personally would probably want deeper as more of like a tool that's called from a consumer LLM app. Is that how I'm is that I'm correctly thinking about this or do you think they'll actually be like will you jump straight to consumer? Um it's like yeah I think in the short term the kinds of people that are particularly like you know cool to sell to are like you know mechanical engineers doing CAD where like they can press the the tab button like software engineers press tab and cursor and have their next like you know minute or two minutes of of manual work um done and and and we showed that in the the kind of gear extrusion demo where like you have this gear and you're like extruding faces and that's just like a very very common thing that you do in CAD. And I think there's like a more general thing where like yeah um you can think of computer use as like a tool call or you can think of it as like you know just the thing that you do um
you know for knowledge work and I think
we're just in a place where like we can scale computer use um on its own. It's not impossible that we'll like initialize from LLM or for example like use text training to like make the model smarter in in text space so it can fill out forms better. But uh it is it is not the goal of the company that like you know people have you have Claude like call this as a tool call. The goal is to just use your computer um or like use its own computer just like in general. talk about your experiments with with self-driving and do does that uh does that work potentially apply to robotics more generally? Yeah. So I think I think this general like pre-training thing or like you know labeling a bunch of uh unsupervised data with with actions uh and then training on that like labeled data um this like inverse dynamics thing works very very or like I expect it to transfer very well to robotics. um self-driving in particular uh it was kind of so so Neil who works at SI was like okay we have this action model um and his friend had a comma and so the there's a comma like joystick mode where you can like control the the steering with with arrows and so we were like okay well if it's a general computer use model surely it should be able to you know control a car um because that's just like a thing that you do on a computer uh it's like video in uh you're you're seeing it on the screen and then you can press the left and right arrow keys to steer. Uh and obviously we originally didn't really expect this to work and then it just like worked much better than expected. We find an hour of data.
It's a good sign.
On how on how many hours? Three hours.
One hour less than one hour 50 minutes.
That's crazy.
And you're able to just fully the the the the system can just navigate around SF.
I mean, sorry, navigate around like South Park. like it's it's you know not general self-driving model. I would not recommend like sitting in this car and just like letting it do whatever it wants. But yeah, it's it's it's pretty cool.
Si take the wheel. Don't make mistakes.
Take the wheel.
We are we are not a Tesla competitor. We are not we are not a competitor.
Do you think uh do you think the sport coat is like the next it apparel item in this? Because it looks fantastic here. The chat loves your sport coat and I just feel like that could be the middle ground between Wall Street and San Francisco,
but you're fantastic.
Um, yeah. I don't know. I really like this. I got it from like Bonobos and Union Square.
There you go.
I think I think I like dressing up like at least a little bit and it's it's fun.
That's good. Okay, back to the business. Uh,
yeah. I I want to know about uh it feels like you're you're training a very generalized model. uh what are you learning from the previous product launches where you know we had this chat GPT moment and then I don't even remember what people were just kind of chatting with chat GPT back and forth and then they started using it they started using it kind of as a Google replacement and then that kicked off the whole like Google's cooked narrative and then with the with the studio Gibli moment it was really the launch of like a better diffusion model with some reasoning in there I think and stuff and so and then people were just like this is a studio Gibbly creator and then they found that niche of like it's really good at creating cartoons. It's not quite style transfer but that's what it does. Well, how much do you want to just like turn a wild open model loose and then hope that someone finds a killer app versus like you kind of know that this is going to kill in CAD and you're just going to launch like cursor for CAD on day one and then like go from there. Yeah, I think there's I think the answer is like some combination and like okay shortterm CAD design work somewhat generally are things that the current models just like totally can't do like LLMs are just like or like anything that is an LLM harness is just like really really bad at CAD for example and so that seems like a okay we know what to do there we like you know can just scale up this model we have a bunch of Blender data a bunch of 3D modeling data in general and we can scale up CAD and then also yeah I like am quite excited to release a more general, you know, tab model for people to to play around with and like figure out what it's particularly good at. And so it's like when I'm asked like what commercialization plans are, it's like we have some reasonable idea of what the first steps are, but like there could just be this like massive uh thing once people start playing with it at that scale.
So you're training on video frames, 30 fps video, correct?
Yeah. I was told by an anonymous poster on X by the name of Rune that text in fact is the universal interface. Was I lied to?
Yes.
Whoa. Shots fired. Explain. Elaborate. Like why doesn't this just collapse down to text? Why don't I puppeteer CAD from text? Like how does this all play together?
Like Okay. I think it is in at some point in in the like arbitrarily long future like if we only use text models we could force like most things to be texted. I think there are just like a lot of things that are much more native um when done from like a computer use like you know GUIs are designed for humans. They're designed for like humans to use. Uh we have you know this massive long tale of like things on the internet that are like entirely undoable by LLM. for example, like when I do ML engineering, right? because like most of my time is is not spent uh most of my time is just like spent doing kind of this grunt work of of engineering and it's like um a lot of looking at graphs and like analyzing graphs and and you know figuring out uh comparing loss curves or something and like you can do this in text but it's just a much larger pain than doing it in this kind of native interface which is um video and I don't know there's a reason why humans don't interact with a computer purely through text. Uh it would kind of suck. Um for example, we have like the concept of like video has the concept of time in a way that text doesn't.
Mhm.
Speak for yourself. I got green I got text right here. Black background going.
If this can eliminate YouTube tutorials for software, that's a killer app. Is this
Yeah. Is this is that anything?
It's not just going to eliminate the tutorials. It's gonna eliminate the whole the whole process because you don't need the tutorial if you're just like just go do the thing that I need you to do. But yeah, I mean you're you're you're obviously in the phase for a while.
How are you thinking about uh go to market in general for the underlying technology?
Yeah. So I think like as I said there there are the kind of shortterm like uh CAD design use cases. There's like the tab model which we want to just like give anyone a kind of general thing of you know in cursor you press tab and and it completes your next edit or whatever. What if you could press tab and it completes like the next five and then 10 and then 60 seconds of what you would do in your computer. Um, and then I think longer term it's just like, you know, we're training a general model that is able to do to do useful work and you'll be able to like send it off with a prompt to do work. And then like there's a very interesting thing where like the data that we're training on is very there's a bunch of like error correction built into it. So like when you have a bunch of data of humans doing things, a lot of the times the humans make mistakes and then they have to like correct those mistakes. Um, and you don't get that with te with text because like most text on the internet you don't get to see the process of like you know messing up and then and then fixing it. Um, and so yeah, I expect there to be a lot of like uh native like no um just like prior of of doing the selfcorrection thing properly. Mhm.
So you can you can get it to like go do something for 10 minutes and it'll like try something for two minutes and then and then like you know mess up slightly, but it like knows how to fix that over and over again until it's like gotten to a solved state.
Yep.
Very cool. Well, congratulations on the launch and thanks for taking the time to come chat with us.
Thank you for sport coating.
I'm gonna We're going to get some We need some sport coats around the office. You need something in between. You got John over here formal. I'm doing casual Friday on a Tuesday, but a sport coat
perfectly in the middle. It was great to meet you and uh come back on soon.
Thank you.
We'll talk to you soon. Cheers.
Goodbye.
Well,
back to the timeline.
Back to the timeline.
Uh there was an individual
who uh accidentally gained control of 7,000 DJI vacuums. He was just vibe coding.
Amazing. and uh accidentally found according to investment hulk the CCP back door control it with a gaming controller
and then he just he got control of everything.
That's so crazy.
This is why I've been deeply concerned with letting uh
any any foreign adversary flood our country with uh
a bunch of robots. I think we should avoid it.
Uh I thought this was funny earlier. Haggth says he'll order random pizzas to throw off the monitoring app.
Oh yeah,
I expected something like this to happen. It's kind of silly that everyone has a dashboard up and can tell when things might be getting a little more tense in the Pentagon. So yeah, give them a budget of,
you know, a few hundred,000 a year and just order pizzas at random times.
For the record, this is a joke and he is joking, but I do I do think they could throw it off potentially. You never know. They could they could they could throw it around.
Hub HubSpot acquired Starter Story.
Yeah, this is very exciting.
Very cool.
Yeah. Starter story overnight success. What a decade he's been doing this.
I believe Pat the founder had just posted.
Yeah,
he posted something. He was like sub or he said HubSpot should acquire Starter Story and then like two weeks later it was done.
Wait, really? Oh, I thought that was from like years ago.
Oh, maybe it was. I thought I thought he posted that a long time ago and then Yeah, he said
No, he said September 23, 2025. So
Oh, wow. Not long. Yeah, couple months ago.
Couple months.
HubSpot should acquire Starter Story. The SEO ship is sinking. In my opinion, HubSpot needs to pivot way harder to video, specifically YouTube.
Yeah,
I'm biased, but acquiring Starter Story would take their YouTube game to the next level.
And he was quoting Brian, the co-founder, saying, "Dear founders, it's a good time to sell your company. Love Brian. Um,
anyways, uh, there's a bunch more stuff in here, but we will get to it tomorrow.
Tomorrow, Arena Mag is out. Go check out issue number seven. They're on Substack now. arenaagazine.substack.com. Go check it out. And
one more post for you.
Deep dish and Joyer says, "I don't see what the point of shoveling snow is when AI agents are going to commoditize burrito taxi services by 2028. It's a good
good excuse.
Leave us five stars on Apple Podcast and Spotify. Subscribe to our newsletter at tbpn.com.
Have the best evening of your of your entire life. We love you.
Goodbye.
Nice work, brothers. I'll see you on the next one.