Rhoda AI emerges from stealth with $450M Series A and a robot foundation model trained on hundreds of millions of videos

Mar 10, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

only thing faster than the AI market? Your business on MongoDB. Don't just build AI, own the platform that powers it. And without further ado, we will bring in Jag Singh from RO.AI. How are you doing? Welcome to the show.

Well, thanks. How are you?

Thank you so much for taking the time to join us. Great. Since this is your first time in the show, uh, please introduce yourself and the company a little bit.

Yeah. So, I'm Jack Deep Singh. I'm one of the co-founders of ROAI and we are building generalist intelligent robot foundation models to solve real problems in manufacturing and logistics.

Okay. Uh help me understand what I'm thinking of when I think of a a robot. I I'm familiar with the Cuka robotic arm, the Optimus humanoid, like the Roomba. There's so many different ways to think about robotics. The Tesla Model 3 is in many ways a robot. Uh where do you see the first uh robot coming online?

Yeah, great question. So uh we've had robots for a long time. Traditional robots have been around for decades and they're exactly what you described. Cuka, you know, robots like that that are basically designed to move through a predefined trajectory. That's program and they can do one thing really well over and over again repeatably.

What they can't do is deal with variability, right? So they don't learn from by themselves from data. There's a new class of robot people working on in Silicon Valley. It's kind of a hot thing as probably you you've heard. Yeah,

these are robots that have a neural network capability and can learn from data.

You feed them a lot of data of robots moving through certain trajectories and they can learn for themselves how to perform those tasks. Uh the problem is those approaches use what's called a VA, a vision language action model. I don't want to you guys with the details, but but

we love the details. So, so go ahead spill the beans. Well, yeah. Well, so so these VLA models, they're, you know, you've probably seen robot demos on the internet where they doing cool things like making coffee or holding t-shirt. The problem is all these demos are just that. They're demos. They work work well in a lab setting,

but they they fail if you move the model into the real world. Yeah.

And the question is why do they fail in the real world? Well, because these models are trained on relatively small data sets uh because you don't have internet scale data sets for robotics trajectories. So people tell they algorate robots like puppeteering robots around to do certain tasks and you collect a number of trajectories that way and then you can train the model to do the task. But because the the quantity of data is so small,

they can only work well if the test set is very similar in its distribution to the data set it was trained on. And the problem is you can do that in a lab setting. But when you take it to the real world, there's a much broader diversity of of settings of configurations of objects, lighting and so on and the models fail. So that's a central problem that we were starting to address is how do you get robots to generalize beyond these very contrived situations that you know that people have shown they work in in lab settings and so we've got a different approach that we're taking. Okay. push back against the teley op uh strategy because I was totally on board with the no teleyop thing during the Tesla boom and uh but then Whimo seemed to do a lot of teleyop and it seemed to sort of work and so when I think about ways to get a lot of data doesn't seem like the craziest thing in a world where we have a bunch of scale AI mer there's all these data you know RLHF teams that are bu you sort of manually curating answers to questions for LLMs like the human in the loop for a medium amount of time seems to be a triedand-trude path. Why are why why does teleyop now make sense in this particular industry?

Great question. I'm glad you asked it. So I think self-driving cars are a bit of a special case because the car is basically a robot that is very easy to tell operate. Sit in it basically have four actuators left, right, speed up and slow down. and and we we all been driving cars for a long time and you can easily put in millions of miles in a car which is what way most of the world had to do to learn how to self-drive. They did collect a lot of data but even there they don't have all the data they need. There's a lot of so-called corner cases that those cars run into that cause issues. In the case of robotics the problem is much more serious because now you're talking about manipulation. You're not just um you know operating a robot with a single environment like a flat road. uh you're dealing with, you know, the full dexterity of a human hand, like 20 degrees of freedom per hand. Uh every object's different, every type of task is different, right? And these things become very difficult to tell operate. Tell the operation process for these. You got to wear a headset. You wear the joysticks in your hands and you're trying to move around. It's just it's just very hard. And and can the problem isn't just the quantity of data, although that's obviously a problem. You could spend a lifetime doing this and we think you wouldn't get to internet scale. Yeah. But the bigger issue is the diversity of data, right? If all the data you have is data that you've intentionally collected,

then you almost by definition haven't seen the corner cases. You haven't seen all those edge scenarios that cause failure. So the way that we're approaching it is is different all together. We, you know, we we our team comes from generative AI and computer vision. And the idea is, you know, if you look at every other AI model that's worked, they all start with an incredible amount of data. typically you know a whole internet's worth of data whether it's language models or image models or video models and then there's a small amount of fine-tuning yeah that you use to align the model that fine-tuning data set uh teleoperation is fine for that by the way yeah that's what we're doing but for the pre-training it's just completely inadequate so what we did is say what

what data set is there that's internet scale massive diversity and from which you can learn about the physics of how things move and there's only one answer

that's internet video right so what we because our team comes from computer vision and generated modeling. That was the approach that we took. So we basically

literally trained the model on hundreds of millions of videos uh really millions of clips in fact and and in our view the model has seen almost anything that that you can see in reality and then with a tiny amount of teley operation data literally on the order of 10 hours right compared to what the VA approach requires which is you like tens of thousands if not hundreds of thousands of hours of data

you can actually teach the robot to do certain tasks. So that level of data efficiency is, you know, we've never seen. That's kind of the one one of the big uh the big breakthroughs here.

So what is the what are the what is the early customer set going to look like? Are are you can you work with any type of uh robotic uh manufacturer? Everything from humanoids to arms or is there a particular focus? What are you most excited about?

Yeah, great question. So so you know we're not going after the consumer market. We think that um there's lower hanging fruit in the commercial industrial um you know logistics markets where you already have a lot of you know tasks being done that people are being paid for uh where you could use some help from from automation. Um and and we we actually are full stack company. So we have this AI model. It's very cool. We call it the direct video action model because it makes a video of what it thinks the robot should do and then it it it converts that to action. So the robot actually does it and does that in closed loop. So it reobserves. Basically the way the model works is observe, predict, act in closed loop. Um, and so that that's a key part of our value creation, but we're we're a full stack company. So we're doing the robot hardware as well.

So why are we doing hardware even though there's 100 plus robot companies in the world? Well, because we couldn't find one that actually met the requirements of the use cases we're interested in. We wanted to lift 25 kilograms, not just once, but all day long. That's called rated capacity, right? We if you try to do that with conventional robots, you'll burn out the motors and that obviously is reliability. It has to be reliable enough to last the full three years that we're targeting basically. But more importantly, we wanted the robot kinematics to have what's called a linear response. Linear responses are are actually uh systems that can be modeled by an AI model. Uh if you have any kind of nonlinearities in a system like compliance or elasticity, that's very hard to model and and that doesn't work well with AI. So we wanted to build more of an Apple-like solution where we control the OS and we control the hardware in order to provide a full solution to the customer. Having said that, some of our customers do want our model to control their existing hardware. Sure. You know, you open up by talking about about KUKA for example. A lot of them have KUKA hardware. So it would be very straightforward to imagine a CUKA arm sending an API call into our model saying here's what I see what should I do. The model responds and says do XYZ robot does that and and and so on. So we we will make the model available to third parties. We want to create a whole ecosystem here. But day one we're really focusing on providing you know just a full solution that that is really tightly integrated.

Yeah. So uh great answer to the the the teley op. Talk about the sim tore gap. Why is simulation building a physics engine and a robot that perfectly mirrors the human body or whatever your hardware is, all the different hinges, and then running that in a virtual environment, Unreal Engine or something like that? Uh, and then trans transferring that learning back to the robot. Why does that not work as well as people expected it to work?

So, first of all, excellent question. You guys are are more up to speed on robotics than I would have guessed. So, kudos to that one. So simreal is exactly what you described. You you learn in simulation and you try to apply that in the real world. There are two problems there. One is um no matter how well you try to model the laws of the physical world um there's always what's called a sim to real gap, right? You can't perfectly model all the complexities of things like deformable objects and transparent objects with light move and so on. But more importantly than that, it's the same kind of problem that happens with teleoperation which is a lack of diversity. Right? If all the data you're collecting is intentionally collected,

then you're not going to see all those corner cases. And that's the fundamental difference between the lab and the real world. The so-called long tail of the distribution, right? In the real world, right? You see a lot more of these long tail events that you might only see once. Y,

you know, in your entire, you know, lifetime, but you need to be able to deal with them. A good example of that, by the way, was this case where, you know, one of the self-driving cars ran into a woman on a bicycle chasing a chicken onto the street. right now. There's no way you're going to see that more than once in your data set. And and and even that you're not going to see in your in your, you know, uh intentionally collected data set,

but you might actually see it on internet video. Like there is a video right now of Tony Hawk doing the 900. And that is a very weird thing for a humanoid body shape to do, but he did it. It's real. It and the laws of physics apply. So there is something that you can learn from that, even if it's in the longest tale, that applies back to just picking up this Diet Coke and taking a sip.

Touche. And and that's and that's actually a really important point. So when we when we curate our data sets for with the pre-training, we don't try to overly curate them down to, for example, manipulation tasks. We have things like videos of waves crashing on a beach.

Sure.

Which you might think has nothing to do with manipulation, but it turns out there's some knowledge about physics and how the world works in those videos. And and that's why this model can can generalize so powerfully because it's been trained on so much stuff that it just in the same way that Chad T was trained on you know TP can can produce you know Shakespearean lyrics if you want or you know whatever you know plays but but it wasn't only trained on Shakespeare train on rap lyrics and and Twitter feed and so on and so the same thing applies here you want to be able to train on everything to provide that prior on how things move

and then you align that prior with robot specific data from telly operation to perform the task.

That makes a ton of sense. Well, congratulations everyone.

How did the round come together? Who participated?

Yeah. Yeah. Who participated?

Yeah, it's a great round. So, uh we we were announcing a series A. So, it's it was led by, you know, Coastal Ventures, led by Tomas, it was led by Frank.

We have a gong here.

Congratulations. $450 million series A. That's

great way to come out of stealth. I always I recommend every founder if you're going to come out for new for new way to make a splash.

Yeah.

And project confidence.

We appreciate you.

Have a great rest of your day.

Talk more.

Thank you.

Let me tell you about Plaid. Plaid powers the apps you use to spend, borrow, and invest securely connecting bank accounts to move money, fight fraud, and improve lending now with AI. And let me also tell you about Shopify.

← Back to story