Amit Jain on Luma AI's multimodal models serving tens of millions and transforming Hollywood and advertising

Jul 2, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

to come on the show. Uh and and you're welcome to send one of your alien friends over to the show to do the interview for you. Send him over. Send him on. Anyway, we have our next guest from Luma AI coming on the stream. How you doing? Good to meet you. Doing very well.

Great to meet you as well and thanks for having me. Of course. Uh welcome to the stream. uh why don't you kick us off with an introduction on yourself, the company, any breaking news for us? Awesome. So uh my name is Ait. I'm one of the co-founders and CEO of Luma.

At Luma, we are building basically what what we consider to be the future of multimodal intelligence. Uh that includes uh what comes after LM building multimodal models that learn from audio, video, language, image all together.

Our first product is uh video generation models, models that are able to generate video, audio all together.

And and currently we are working uh with with Hollywood, we are working with advertising agencies, we are working with also a lot of individual creators and and we have about like you know uh um tens of multiple tens of millions of users. We just don't talk about the exact number of users. Sure. Sure. Sure.

What uh explain to me multimodal uh uh models? Like uh are you trying to tokenize everything into some sort of standard format so that a an image, a video, code, text, it all appears as uh as tokens in the same kind of stream? Or are you creating a system that can kind of shift gears between one and the next?

That's that's a great question and this is basically the core differentiation. So right now most people are doing the latter. Most of the AI labs that have large language models they're using those as backbones and trying to like you know teach them how to read image and video as token.

But the problem with that approach is that like you know uh the output is very very subpar and and even on the most basic tasks second generation deep learning models which are like you know uh your convolutional networks outperform on most VLM tasks right this the first approach which is you create a joint latent space where all of this information is in the same uh representation and then when you learn on it it is quite analogous to like you know how brain learns or like you know a human brain right?

Audio, video, image, they're not really terribly different for us. Like these are all just signals. And when we think of an event like let's say dropping a dropping a glass, right? We can hear it in our head. We can see in our head and we al can predict like oh if it's a piece of glass it might it might break.

So currently this disparit approach of training a language model and a video model and an image model things like that this is going to get old very quickly because on language model side like you know we're kind of running out of data right like you know uh as you're seeing with GBD5 or the previous thing that was called GBD5 10 times more compute 10 times larger model but like the the the performance isn't much better than GBD4 so it was retro retrogressively renamed to 4.

5 uh on image and video model side like even the models we make today are kind of dumb uh They're basically translators. They take text, right? Like, you know, uh or prompt and translate them into video, which is interesting, but that doesn't solve someone's problem.

And the problem being like, you know, hey, explain to me how something works or tell me the story, right? Like or or or here is a uh like you know, here's the script, make me the whole thing. That needs a level of reasoning and understanding that these models don't have.

So when you combine them all together then one you solve the data problem and because you know there's immense amount of image audio video data in the world and two you are able to build intelligence that goes beyond what LM are able to do today. Talk to me about images and chat GBT.

Do you think that there are multiple layers going on there? I was shocked by the quality of text and then the quality of the studio Gibli moment. the sty it feels like style transfer on steroids.

Um yeah, but I was particularly surprised they could do both and it felt like it felt almost like um two different layers running and I was trying to do some tests where I would have generate some text and then have a snake weave through in front and behind the text and it was kind of struggling with that and it was making me feel like maybe there was like a second pass here.

Is that a reasonable theory?

And also, I'm not saying that as like a knock on anything because if I went to a human and asked how a human would would design a a birthday card with a, you know, a dinosaur in the background and then happy birthday text on top, they would draw the dinosaur and then they would use type setting or they would draw the text on top.

And so I I don't necessarily think that's like the wrong path. I'm not knocking the strategy. I'm just curious about what is actually going on in that model because it was remarkable. Um, but but it raised a bunch of questions for me. Absolutely.

So to the best of our understanding basically they're doing um like you know auto reggressive uh image generation as the first pass right uh that that produces a a blurry low resolution image or or somewhat high resolution image but like you know you're missing sort of sort of high frequency details and then you do a diffusion pass on it to get to a uh decent quality uh you know image representation.

Now um there's so text is one thing but I I believe text actually text rendering in image models especially like the latest image models and then what we're seeing in our next photon 2 model that's actually getting very good. So that is not the problem. You know what's really interesting though is structure.

So when you do image rendering with just a diffusion model like you know um um stable diffusion or like you know name your pick. Yeah.

there's really like you rely on the text encoder of of the model to like you know kind of tell you about the semantics of what the person is asking about and then you just rely on the diffusion model to do all the intelligence and like you know u um let's say you're asking it hey what is 2 plus2 right um if there's nothing in front of it like an LLM that is able to say like oh user is asking 2 plus2 so let me draw equals to four um if you don't have that then these models have ridiculously zero understanding of whatever you're trying to say right it will just render 2 plus 2 equals to and stop doing it.

Right? This is that's what I mean zero understanding. But when you have an LLM in like you know generating it or or a thing that has good understanding of language and it's it's unified.

So they call it the 40 omni right um then in language space especially within the model you can reason about it that like oh 2 plus2 should be four and let me render that. Now to what I believe though GPT40 is is still a system rather than just a model. Sure. Right.

So be when you type something it still goes through GPT. Right. Uh they do all the prompt enhancement and and and reasoning about it in language space and then give it to the model and it does a very good job of rendering it.

But when you ask for like you know complex layouts then I think that that strength starts to show. Mhm. But of course you gave an example of like you know the snake coming in and out. Uh so this is again the question of prompt alignment and and while it is good it's not perfect yet.

Do you have any reaction to uh in the Wall Street Journal today there's Hollywood wants AI protection? There's a lot of back and forth there. I was particularly shocked that uh there's been a number save us save us. I mean there's been a number of interesting moments in uh AI imagery and AI video.

Obviously like the narrative the general narrative is like the technologists are like this is amazing. We've created this amazing algorithm and I agree with that. And then the backlash is like we don't want all the you know the the photographers to be put out of business.

But the more interesting uh stories that have emerged are like V3 and Google seem to be able to generate Disney IP with perfect accuracy, no problems. And maybe that's because they have some deal with YouTube.

And then when the Giblly moment happened, uh I was like wait like like like GPT images in chat GPT would reject different things for for safety reasons but also for intellectual property reasons but not Studio Gibli which is a style but it's also intellectual property and it's not exactly like the cleanest thing.

So but maybe there's a deal behind the scene.

So how do you think uh how do you think like the like the the the interplay between uh the the startups and the technologists and the legacy media Hollywood Washington DC like what stories are you tracking how is that evolving right so there's two axes that that you're talking about one is basically the adoption and job losses and things like that right you know that is a narrative and the second is the IP itself and and how do you think about that right.

So on the first one there's like it's actually at this point a foregone conclusion that unless Hollywood and and and the traditional way of producing media changes they're on a path to extinction.

There is really no two ways about it and and of course there are arguments like oh but it can't do this it can't do this it can't do this.

The thing is like you know we spent last year 2024 building out this infrastructure for training these really large really capable models and like you know it trains on pabytes and pabytes of data and and much more complex than LLMs because LM like you know of course they're complicated to train but you still have like very little data the amount of engineering that that is required to train multimodal models is insane anyway now that the infrastructure exists the rate of progress is very very fast we saw the same thing in language right initially everyone had to build from 0 to one but once you have 0 to1 they're like you know turning out models takes a quarter sometimes less right um so anything that they think oh it can't do today yet like you know it can't do 4K well tomorrow it's going to like we already do 4K oh it it's not able to let me control like you know exactly the pose I do we launched this thing called modify video where you can give it your camera feed as a prompt and you can control exactly any camera movement any pose anything that you want right so we work with some of our great partners in Holly ollywood uh and like you know some of them are very forward some of them are very much not there today but when you think about what a studio is right like a studio is two parts a financial institution that is responsible for like you know uh greenlighting things and managing like okay this is the area we do go and this is the area we don't go and all these kind of things and a production part right the production part largely works by contracting other smaller production houses but generally that's that's that's the structure AI especially generative media changes the economics so wildly from like, you know, $100,000 a minute and to sometimes a million dollar a minute all the way down to like, you know, $10 a minute, $100 a minute.

When that is the level of economic change and you follow the money, there's just no no scenario in which you don't come out thinking that like, okay, the old way of doing is not going to work. So, yeah.

And an interesting thing I've been thinking about is is if the cost of production drops by an order of magnitude just by being able to shoot some maybe you're shooting some scenes with regular actors and cameras and union crews, but then you're able to create certain scenes or moments.

Um, if the C seems like there the the internet has this insatiable demand for content and it feels like when you look at some parts of the sort of entertainment production stack, writer like great writers might be more in demand than ever because like they'll have to write more like but but sort of like stories will be more in demand like you know you can imagine a world where HBO is putting out you know an iconic you potentially iconic new property or franchise or like shipping them monthly, right?

When historically maybe they only had one big, you know, one or two massive releases a year like a Game of Thrones. And so I actually think that it the the consumer is going to win and the industry, the people that adapt within the industry and say, "I'm going to lean into using this tool so I can write better.

I can I can produce. " Um, or even even the idea that I I'm sure you're I'm curious if if scripts are now certain uh scripts are now being encouraged like bring your script with kind of an idea of what this world looks and feels like so that it's not just like this static uh you know textbased uh concept, right?

Show me, give me the pilot basically because and exactly and it's top of mind because today we we released an ad uh that we handmade farmtotable content uh for one of our partners Wander and there's like a few scenes in there that you probably could have AI generated we didn't but they cost us real money right like that that we could have potentially stripped out and it feels like feels like we're so close.

Yeah. So here's the over andunder on that actually. Um the the the demand for video is absolutely absurd. You're right. Like you know, so average person is watching about three and a half hours of video on their phones every day. Um I don't know if it's good, but it's a big number.

We're certainly we we're, you know, might be contributing to that. But it's true. And and the the the the thinking about that is very simple that like people don't like to read.

If they could watch an explanation, watch how to do something, watch a piece of news, whatever have you, they would do that over reading it or reading about it, right? Like all day, all night.

So that means um for entertainment, yes, like you know, generally entertainment we think about like okay, you make it once and then millions of people will watch it again and again and all these kind of things.

But when you think about like news and you think about question and answer and you think about explanations and all these kind of things, that's ridiculous, right? like you know you can't generate a a video for every explanation for every scenario on every planet. It has to be done dynamically.

It has to be done automatically and it has to be done with AI. Now the number I track is okay for every person on the planet like you know out of that 3. 5 hours that they watched if we can generate at least an hour of video a day uh that is what the success metric looks like for us.

So that is looks like about 6 billion hours of video per day generated. One, there's no human capacity that can actually generate that that kind of idiom, right? Editors can't do that. And two, um there's no compute in the universe at the moment to be able to actually do that.

So, but that is what to track towards because video is now the substrate of information on the internet. There's two things happening to the internet, right? One, it's becoming zero click, which is like, you know, you don't hunt around in websites and try to like, oh, this answer, this answer.

No, you just ask an LLM or now Google also has AI mode. If you're in the labs experiment then you just get the get the damn answer. Um and second thing that is happening is the substrate of information is changing from from text to video. This is what people do.

So if you if you just extrapolate from this trend um video production has to change entertainment has to change and um people who don't come along like you know it's going to be very hard journey for that advertising on the like we can talk about advertising for about an hour and like you know Ken was very illuminating for me uh in that way about like how people are thinking about it it yeah I want to get the full update from con uh is this your first time going break us down like what this is this is the con film I mean the the one one of the exciting things about what you're doing is when you look at the cost structures of consumer brands which we both have exposure to so much often times consu consumer brands are held back because they don't have net new ad creative and so not only is it a cost it just takes a long time to produce and so you have these gaps where a brand will see lower performance you know slowing growth etc.

and it's really just because they don't have access to net new ad creative that can drive new demand. And so that's an entire area that um I don't think that's being fully priced in yet, how the the great brands will actually be able to grow faster because of just being able to daybyday generate net new um ad creative.

Yeah. Give us the full con recap. Sorry, answer that question and then I want the con recap. Yeah, I I think the answer to that question in can is almost the same. So yes, it was it was my first time. Um, beautiful place to be honest with you.

Uh, but but really interesting like you know uh one everybody I mean it was it's not an AI event. It's not an AI conference but literally every conversation was about AI. every conversation. But what's really interesting is that maybe 1% of the people knew what the hell they were talking about.

Um I I I Nobody knows, but everybody's excited. Yes. But I got blisters in my ears hearing the word agent. Um all these like, you know, agency. Yeah. all these agency people pitching, right? Like, oh, we have AI agents to do this and we have AI agents, we're building this to do this.

In fact, like, you know, Adobe was there and and their salespeople were trying to pitch non-existent AI agents to people and trying to log them into three-year contracts, right? It was palpable the level of desperation. Um, uh, shots fired. Shots fired. But it's insane.

And, and when you go and ask like, "Okay, what do you mean? " Right? It can be we talk to we talk to everyone and and and what is it? It's a prompt box where you can select a bunch of models. That's it. That is not an AI agent that doesn't do anything. Uh um right like you know senator that's SAS. That is SAS. Exactly.

And and and the issue is that basically like you know a lot of people are thinking like okay well you know developing this is basically like you know nothing and we'll just like you know produce the models and and it's going to happen.

developing model developing agents in creative space in visual space is very difficult work right like you know you need new kinds of models the kinds we are training uh because think about it like Gemini is the best VLM that is out there right now and it is barely able to meet about 3% of our benchmark of visual uh like iteration let's say you're talking about brands so we we met quite a few including like you know our partners like Coca-Cola and others right if you want to make an ad and then let's let's not think about a Super Bowl ad if you just want to like you know make creatives like you say 100 ads for different markets.

Um, getting them to be right, getting them to be on brand, getting them to like, you know, look exactly like, you know, how they should look like, that's a hard job. Like current image and video models, as I was telling you, like are dumb. They will just like, you know, make whatever. They have no idea.

And then when you use VLMs to critique them, the outputs are are barely any good.

So it's you need a new kinds of you need like the next thing after LLMs the next kind of models that are able to understand visual information instructions and and and what makes something engaging that level of information is what you need like LM are getting there right like for for stories when you write text LMs are now making things that are very engaging now we need that next jump in in in models so that was my like you know overall impression in Canada the second thing I felt like you know by next year this place is going to look very very different.

Uh like you know fortunes are going to change significantly because when you talk to some of the most senior folks in like you know various agencies uh you know whatever have you the writing is on the wall for them like they employ hundreds of thousands of artists and and like you know people and and they are asking us like oh how do we actually do this more efficiently uh and and the thing is like that that's that's incorrect way of thinking about it.

So uh um long story short like um it was not an AI conference but it ended up being an AI conference where nobody knew what the hell actually the AI is and consequently nobody knew what is coming. Mhm. What areas do you think are safe from disruption?

We talked to a photographer who does stuff in CPG and he was saying there actually could be a situation where if you're McDonald's, you can't generate AI images of your burger because it could be there are already FTC rules around you have to use the real burger because there were all these like the false advertising lawsuits where they'd show the perfect cinematography burger that was perfectly made and looked amazing and then you'd go expectation versus reality and those those memes went viral.

enough and the the flat disgusting burger bubbled up and eventually I guess there was a lawsuit and so now you know and and even in uh if you're selling a phone you have to say screen images simulated or if you're driving on a truck you have to say close course professional driver that's for safety reasons obviously drink responsibly if it's an alcohol so the FTC has a bunch of rules do we think there'll be anything there on the AI side I think those rules will change too basically So currently like so the burger example, right?

The problem is you are over representing the product you're selling. Yes. Right. It is not the problem of the tooling you used to make it with. Uh you could have made it how the thing actually is.

I mean that would look very disappointing in the ads but you could have made it closer to how the thing actually is such that like you know people aren't misleading thinking like oh I'm going to get this big giant thing which is very fluffy and then you when you get in like oof this is the thing right?

So it's really not about the tool you used. It is about the intention of what you had behind it. You clearly wanted to mislead. So I think the FTC rule is right. But the the rule should not be oh this should not be created with uh CGI or with AI or whatever have you. The rule should be don't misrepresent your product.

And for me basically the the singular thing to think about is follow the money. Right?

So when you're f like you know actually in advertising the point you were making is is really really salient which is advertising is all about stats right and and how like something that performs well something doesn't perform well it's very hard to predict how quickly can you replace the thing that is not performing well and in the gradient of the things that are performing well that will lead to the best output today that cannot be done automatically at all in any visual form of advertising that is done for text advertising it has been algorithmic make for a very long time now and that results in very very good outputs.

So when companies actually are faced with the idea like okay well like you know on our other products where we can do this it performs so well and on on these products no the regulations they will lobby for the regulations to change on this kind of thing and I think the regulations should be don't misrepresent your product not don't don't use AI.

Yeah. Last question for me what about more narrow AI use cases in Hollywood.

Um I'm interested in like just just like in the VFX pipeline there are um one second let's do sorry I'm scheduling some let's we're doing it live show my question now we're we're going to clip this so it doesn't matter um my question is uh more narrow VFX use cases like uh is there a single company that's just doing amazing green screen rotoscoping I've seen runway do that in like kind of the proumer arena, but there's so many different pieces of the pipeline that I think artists would be much less jostled by if it's just, hey, I I have to track a camera all the time in 3D.

You know, there's there there's services that do this using traditional um like CPU based pipelines and solvers. There's IMUs that measure the movement of the camera. no one really cares about that that particular role. It's kind of just a hassle that they have to deal with. Um, right.

And and and so all sorts of different more minor vertical AI use cases. Those feel like point solutions that could be deployed right now. People could go and get a foothold. Maybe they expand to oh prompt to movie, but it just in the meantime like how about we just speed up the rotoscoping. Yeah.

Uh I mean why not, right? So, so for instance, like you know, we released this thing uh it's called refframe. So, what it does is you give it video in any aspect ratio and it's able to generate like you know any aspect ratio out of it without cropping.

So like it will generate like you know if you're if you if you give it a 16 is to9 vertical video and now you want to share it on on YouTube for instance where like vertical videos don't perform well you want something like you know that is horizontal 16 or 16 to9 ideally you can like you know generate any of these variants any number of these variants all you want and it's a point solution doesn't matter how the original video was made you could have totally created that clip using any traditional methods or whatever have you so I think this is So here's something interesting, right?

This is not what you asked me, but but I promise I'll lead to the answer you you're talking about. So there at least in my life since I've been conscious about it, uh there's been like, you know, two really big changes before AI. One was mobile uh and and mobile was a very large consumer side change, right?

Like you know, people changed their behaviors drastically like you know, not overnight obviously, but it looked like overnight to to be honest with you.

Y and then the second one happened few years later which was cloud and cloud consumers didn't really it didn't affect them too much right maybe enterprises yeah but it affected companies and companies were done inside out things they used to do on paper things they used to do on prem things they used to do like you know so these were two very radical transformations but they happened to different segments of the world AI is for the first time at least in my memory uh a transformation that is happening on consumer and enterprise levels.

It's basically rewiring companies from the inside and it is changing consumer behavior like nothing else before. Right? So coming back to your point now, you can take the slow route. you can say like, "Okay, well, like, you know, let's let's do rotoscoping.

" But the the tiny studio down the street, uh, like, you know, that used to probably just like, you know, create frames for you, uh, up until last year is now going to start releasing stuff that like, you know, uh, um, looks kind of as good as yours and and now, but they can do like, you know, at at a fast clip uh, and they're they're releasing like, you know, weekly episodes and then iterating on it and and how long will it take before their IP is actually bigger than yours?

Mhm. Yeah, it's a good question. We have seen this happen on YouTube again and again and again and again. Coco Melon, right? Uh um like it started from nothing and look at this now. So bigger than a lot of Disney IPs, right? And Disney I think Disney bought them. I Disney tried to buy them. Something happened there.

So um this is basically the the difference. You can do like you know and there's nothing wrong with it. You should do the small things.

But if you are in the business of producing content, if you are in the business of of thinking about advertising, if you're in the business of touching pixels, the world is not the same as it was 2 years ago. And not in a small way, in a very very deep way.

While it might not solve the whole thing for you at the moment, but you have to start really thinking about what would the economics look like a year from now, two years from now, and five years from now, and what what should the business be like exactly? It's great. Well, this has been great.

We would love to have you back on to uh there's so many different topics we can cover. So, and I I appreciate how freely that you're willing to speak on on so many different on so many different topics. So, that's the only way. That's the only way. You just got to be yourself. Awesome. Well, thank you for

← Back to story