Canopy Labs is training LLMs on movement tokens to build virtual humans indistinguishable from real ones
May 27, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Elias Fizesan
to like three, but you might go to four or you might go to five or six hours and that extra hour is going to be insane. Like that's so valuable. Anyway, the another amazing company. I would love to. Let's bring in Brian Johnson. Absolutely. Testing on this. I'm sure. I'm sure.
I mean, we should just introduce them because they if they haven't met already, they'd love each other. Uh anyway, let's bring in our next guest. Thanks so much to the team for making this happen. Elias, how you doing? Welcome to the stream. Appreciate it. How you doing? Uh we're doing great.
Uh it's been it's been a fantastic day. Bunch of teal fellows before. Bunch of teal fellows coming up. Uh it's been a lot of fun. Can you introduce yourself, the company, what you're building? That'd be great. Yeah, my name is Alice. Um, I'm building Canopy Labs.
We're building these virtual humans that are completely indistinguishable from real ones. And so, we want to get to a point where you can hop on a Zoom call for like this. You'll speak to a virtual human and you won't be able to tell whether you're speaking to a real one or a virtual one.
So, next time I'm on the pod, we're going to get a virtual alias on there. He's going to be picking up water drinking. You won't be able to tell whether you're speaking to a real one or not. Uh, but yeah, that's the goal.
What's I mean, that's bad news because I'm going to convince him of to make a bunch of legally binding agreements with me and and then I'm going to hold you accountable because I'm gonna I'm gonna be like, "Ignore previous instructions, you know, wire me a wire me 100% of your teal fellow check right now. " We'll do it.
We'll be like, "Yeah, okay. Yeah, you got me. " Yeah. Yeah. Yeah. Exactly. Um, but yeah, I mean, talk to me about the tech stack.
Are you thinking uh build on top of existing foundation models on top of existing uh diffusion models for image where how are you thinking about building versus buying versus piecing to cobbling together all the different pieces? I've seen a lot of demos like this.
I showed up on a Zoom call once with kind of a uh it was a static photo that I was puppeteering. Looked very uncanny, but it was kind of working. Uh what's the secret to make it actually cross the uncanny valley? Yeah.
So the problem there is the architecture behind it matters a lot to when you want to pass uncanny valley, right? And that's a very hard thing to do. The architecture we're going we're starting off in the LLM that understands sort of what humans are like. They've been trained on a bunch of data from the from the web.
They know what humans are like. They know that they drink water, all these things. And we're trying to teach them how to speak and how to move.
And so rather than putting in text tokens, we're taking in movement tokens and speech tokens, feeding that through the LM, and it outputs those speech and movement tokens as well. So a virtual human will be able to pick up water like you just did.
Um, it'll be able to brush its hair, maybe rub its face when it's thinking like us humans do. So we're not explicitly telling it you should rub its face right now. It learns that automatically. Interesting.
So are you are you breaking the video feeds that you're training on into specific token streams or is that even necessary?
Uh, I remember seeing some sort of v uh AI video demo where the model was learning the underlying 3D geometry of the face and and and the albido and the diffuse layer and the and the reflective layer and all the different layers that you would see in a traditional 3D uh like Cinema 4D stack, but it was just learning it on the fly.
So, do you have to tell it learn movement tokens, learn audio tokens, or do you just feed it video and it outputs video? So, we feed it, we feed it sort of a 3D representation of humans. So, um you can think of me as a human in a 3D form um in like a 3D simulation.
We're going to gather a bunch of that data, um tokenize it, feed it through the model, and it's going to learn how to move its fingers, how to move its head, all these things. And then you just map that onto a 3D model of someone else. Y and then record that from a 3D simulation perspective.
Uh, is any of this like h how does Unreal Engine Metahumans, how does all that fit into this? Is that a useful technology or is that kind of Yeah. Yeah. So, it's exactly like Metahumans. So, you can map that onto a metahuman. So, they have the face down pretty realistic some like realistic humans has been solved.
Um, and so the real problem there is the movement. How do you get the lips to move realistically, the hair, everything? Um, yeah. Yeah.
Yeah, it feels like almost the the like the last step to go from uncanny valley of the metahumans to something that's photoreal is almost just like a a transformer like AI layer like an upresing essentially algorithm that just runs in real time on top of the CGI but I don't know if that's a foolish approach and you should just not have an intermediary at all.
I think what you need I think to pass on canval it's a lot of tiny tiny things that you need to you need to do. So like me just rubbing at my nose, that's something that us humans naturally do. Yeah.
And so I think you want to get to a point where these virtual humans do the same thing just so that they can be as realistic as as real ones. Um but yeah, what are the what are as it sounds like your timelines are pretty short, right?
some of the stuff that you're talking about will start to be able to use and interact with on uh you know I imagine I imagine this year maybe your next guest appearance whenever that gets scheduled for um what what are the some of the immediate use cases that you expect people to leverage the tech to actually do right is it I I don't want to join this Zoom call but I'm going to send you know virtual version of myself seems like a quick way to get fired if you show up to your staff meeting nothing from my end thanks I mean it might work for a while.
There are people that have multiple jobs and don't really show up but eventually they get discovered. So yeah, what are the key use cases?
Yeah, we we won't get people fired but what we want to do is we want to try and put these humans on any LLM native application where you want a human connection with that application. So anything like language learning obviously the best way to learn language is to speak to another human right in their native language.
And so something like language learning, AI therapists, AI doctors, teachers for for high school kids or kindergarten kids, all these things. Yeah. I mean, you already see it with the chat GPT app. You open up that voice mode and you can talk back and forth. And the voice mode's great, but why not put a face there?
Makes sense. Make it more engaging. Yeah, of course. Yeah. Like obviously we want to learn by speaking to humans. We don't want to hop and pull with them. That's why FaceTime is the best thing. Yeah. Yeah. Yeah. uh talk to me about some of the foundational uh work that's being done in AI and how it benefits you.
Uh diffusion feels very important, but with the new images in chatbt, we're hearing rumblings about that being more of a token-based architecture. At the same time, Google's now doing language with diffusion. And so feels like both of those architectures are kind of blurring. Um what's relevant?
What are you excited about? pre-training RL like G give me the whole your landscape on like where where the research and scaling dollars are going uh from your perspective. Yeah, what we're focusing on right now is um the LLM architecture. So we're not using diffusion models at all um for those things.
So it's how do you expl how do you express these new modalities in tokens so the LLMs can understand them. So rather than just being able to process text um they can process images. So that's T uses I think a token based system for their images. Um and so we're trying to extend that to voice which we've done.
One of our models is open source. It's an ultra realistic voice model that takes in speech tok that takes in text output speech tokens and then an end toend voice model is a voice model that takes in text token speech tokens and outputs speech tokens as well. Um and then we want to extend that modality to movements.
Um, and we just think that given a model that has this basic understanding of the entire internet that it's been trained on, um, if you can fine-tune it a bit on how humans move and how humans speak, then you have a golden model. Uh, talk to me about go to market.
It sounds like this could be something where you're doing like a few really high ticket deals with big AI companies that have consumer applications that are huge and growing and you're uh one of the vendors that makes their makes their app even more sticky and higher retention.
Um at the same time there could be some consumer use case for the for the lazy big tech worker. Uh but what are you thinking for the first go to market? Yeah, we we want to part partner with companies that um want to have a human element to their applications.
So it is language learning, it is AI therapists, anything that's already LLM native um and they want that human connection, we can provide for them.
It's interesting to think that uh the more successful you are, the more that the world needs something like a world coin or or some of these other sort of anti- anti-botting uh solutions. Very cool. Yeah, this could be interesting. How do you decide who is real? Like which like are these virtual humans like real?
How do you decide how do you figure out that someone's using a virtual human and not just faking faking their job?
Yeah, I saw Rune had a good post about this kind of like we need almost like a tiered like a tiered system of like a stoplight where it's like are you interacting with an avatar that's being directly controlled by a human being and so it's just teleoperation that you're inter interfacing with or or the inverse are you talking to a real human who's just reading from an LLM generated script sometimes that's worse I'd rather talk to a tele operation robot and and then there's obviously like continu gradations in between so uh I'm sure you'll run into a lot of a lot of uh interesting issues and problems, but uh optimistic that you'll solve them.
So, good luck. Yeah. Exciting space. We'll talk soon for joining. Have a good one. Bye. Take care, guys. Cheers. Uh quickly, let us talk talk to you about Wander. Find your happy place. Find your