Hugging Face CEO Clément Delangue on hitting 10M developers, open-source AI's strategic importance, and why video datasets are the fastest-growing category
Jun 24, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Clément Delangue
artificial intelligence, open source, everything that's going on with What is the origin of the name Hugging Face? What is artificial intelligence? It's the It's the emoji, I believe. Yes. What is going on? How you doing? Welcome. Hey, that's the emoji like that. The hugging face emoji. Yeah. Yeah. Yeah. But come on.
You guys are giving me [ __ ] about the name with a name like that. TBPN much better than Pugging Face. I'm not giving I'm not giving you [ __ ] I just I just read and then I just immediately see the the obviously the icon. But there's there's more to the name than that.
It it it it's an emoji, but it came because the product that you were building before this was related to that, right? Break that story down. Yeah. Yeah. When we started the company eight years ago, we were building some sort of an Tamagotchi AI AI girlfriend Chad GPT before Chad GPT. There you go.
So, very much kind of like an entertainment uh product. And so, uh we picked the hugging face emoji because that's the one that we were using the most. Yep. We also had in this joke that we wanted to be the first company to go public with an emoji instead of the threeletter ticker, you know, like on the on the NASDAQ.
Yep. Um still still fingers crossed for that. I hope nobody's going to go public with an emoji before us and that they'll wait for us. Yeah. And then the community just loved it and started to put it everywhere. Put it on their on their clothes, on their on social networks, everywhere. So, we decided to keep it. Yeah.
The the the owning an emoji as a brand online, I think, is still under underrated strategy. We see it every once in a while. I've seen people use the the the the different fruit emojis for different things. And there was a whole strawberry story tied to OpenAI and stuff.
Um, but when you can condense down, I mean, it's a it's a coinage essentially, like when you can own an emoji and have it say something, uh, that's very powerful from a branding perspective. Yeah. You can even search with an emoji on on Google and the app store directly with the hugging face emoji. Yeah.
I mean, you're going to compete with the people that use the emojis for any other reason. And there's always the risk that somebody steals. I bet I think Yeah, I think if you search the hugging face emoji, you're going to you're going to wind up in the right place. Um, but walk me through the business today.
How how how what give me an idea of the scale and kind of the core business model. Yeah. So um we're one of the most used platform for AI builders. We just crossed actually today 10 million AI builders using us. So it's mostly AI scientists. Congratulations 10 million. That's 10 million AI builders.
So it's mostly AI scientists. um AI engineers, software engineers, uh building models, training models, optimizing models, sharing models, data sets, apps. So there's a new new repository. So it's model, data sets, apps created on hugging face every 10 seconds now. Wow.
Um and and the way the way we monetize because we you asked this question is kind of like a pretty straightforward premium model where most of the usage and the users are free and then a small percentage are premium and usually they become premium with premium features.
So for example, enterprise features when like a big company like Google, like Nvidia are using us, obviously they need user management, security stuff like that or when they need premium uh compute, right?
So when they need more powerful GPUs or or hardware to to run some of the stuff they're doing on the platform, yeah, 10 million AI developers, if you paid each one hundred million, which seems like the going rate these days, it's a quadrillion dollars. It's a lot of money.
But all even even if you're paying reasonable rates, you're still up in the in the in the trillions of dollars for the total market. Uh I I mean I would be interested to get your reaction.
Uh how have you been processing the news that the the talent market for artificial intelligence researchers seems to be hotter than ever in history and we're getting into NBA money for top researchers. Uh like do you think this makes sense given where we are in the cycle? What has been your take overall?
Yeah, it's definitely harder than it's ever been because when the CEO of OpenAI is saying that an other company is paying more to get OpenAI employees, it's really like here is the top because historically obviously they've been kind of like the highest paying u company ever in terms of of packages.
So when when when he's saying that they're getting um beaten uh on on packages is quite quite phenomenal. Um you know I hope it's not going to continue too long and there's going to be more kind of like a almost democratization of skills of uh of AI building. Otherwise I think we'll end up in a quite quite weird world.
One of my biggest focus is kind of like how to fight concentration of power, concentration of skills, concentrations of resources in AI. Um, and and I hope we can progressively move into a world where everyone can build AI and not just kind of like a few hundreds AI scientists.
That that' be that'd be really great for me. What is your interaction today with the various um prompt to code tools? What are you most excited about? you know what what is that space because that that feels like kind of the the entry point.
Somebody comes in and they they making software for the first time and that can be maybe a gateway into exploring the whole ecosystem of of hugging face and everything there. It's super exciting right to empower more people to be to be builders.
We actually released last week our MCT that is integrated into into code uh into into chat GPT that they they announced actually I think a few days few days ago um into codeex um into cursor um so we integrated with with all of these with like a specific focus which is not only to empower people to build like website or simple apps kind of like the previous generation of of apps but trying to empower everyone to actually build uh AI models, right?
Which becomes really really exciting because uh it means that maybe everyone can start training optimizing their own model that then they use themselves on their interface to kind of like keep building even more. So that's when you start to have kind of like maybe this flywheel uh in terms of AI AI progress.
Uh so that's kind of like our focus with the integration of hugging face MCPS with a lot of these tools. on the open- source question kind of like the decentralization of power.
Uh what was your interpretation of the anthropic news today that uh it was in fact fair use for them to train their models on it was several hundred thousand books or several million 7 million books.
Um I was kind of you know uh optimistic that that would happen but also just uh it seemed like we clearly needed some sort of legal uh ruling here to understand this stuff. But but how did you process it? Were you happy the with the result? Yeah, I think this is good news.
Uh obviously I think this this ruling they'll come uh and be different and specific depending on the use cases because you have to look at not only you know what's been used but also how it's used if it's kind of like transformational if it's replacing competing with the initial data sets. Yeah.
Um what's been cool for for open source is that I think by default open source is fair use. So when someone releases kind of like a data set on hugging face uh in my opinion it's it's usually fair use because it's used for educational for progress.
If you look at kind of like when copyright was invented and and designed uh the main focus was not to prevent progress and learning right and education. You don't want to put like copyright rules that are limiting progress.
And so I think in in open source most of the time when you release models or data sets in open source it's it's fair use.
So hopefully that's going to be kind of like a little bit more of the motivation for companies especially in the US who've used a little bit this argument not to release their models and the data sets to uh do it a little bit more because I think we really need it now.
I don't know if you you've seen yesterday the the conference from Dan from semi analysis talking about kind of like the energy limitations in the US compared to China. Yeah.
Something interesting there is that in addition to that um we're mutualizing way less in the US than in China because in the US most of the leading puncher labs are propriatory and so they all do the same training runs. Oh, interesting. If you think of it, right?
If you think of entropic, open AAI, XAI, they all do the almost the exact same million dollar training runs. Yeah. Whereas in China, they're much more open, right? Deepseek, for example, and so they mutualize compute and energy much much more.
So it's not only that they have more capacity, but also they use this capacity much better. Um, so I think it's it's urgent that in the US we kind of like find find solutions for for that. Um, not to create additional risks for the development of of AI in the US. Yeah. Yeah.
It's kind of oddly become more monopolistic over there or something like that. It's it's kind of an odd odd outcome, but I that certainly makes a lot of sense. Um on the on the training data sets that go out. What are you seeing on the video frontier?
Um, it feels like we've been kind of batting back and forth this idea that Google might have a really, really powerful sustaining advantage there because of the YouTube data set. Uh, you know, you talk about 7,000 books. You can probably download that on a torrent somewhere if you're creative enough.
Uh, you might be able to scrape it onto a single hard drive. We've heard about folks putting training data on hard drives, flying to Malaysia, doing a training run, flying back. Uh, you think about GitHub, all the code's basically been stolen at this point. Like, it's out there.
Some of it's open source, some of it's been, you know, it's it's obtainable. It's much harder to get all of YouTube on a single hard drive and steal it or even just have a scraper because you know what, it's like hundreds of hours are getting uploaded every minute or something like that.
So, what are you seeing on the on the video training side in open source, closed source? How do you think that submarket plays out in artificial intelligence? An interesting data point there is that on on hugging phase, there's almost half a million open data sets. Mh. Uh with a thousand new data sets added every day.
And the fastest growing category is video data sets. Uh and the reason why is because not only uh it's more of the focus for for training but also because I think we're starting to see more and more synthetic data sets uh on video that are starting to really work. Yep.
U because especially because the physics is starting to work on a lot of these uh video generation tools.
Um it's not only used to create other uh video generation models, but it's also starting to be used for example in uh robotics that we also starting to see quite quite a a strong growth to kind of like almost use uh the physics of the video as kind of like synthetic training to train robots, right?
I think it's Musk who said a few weeks ago like oh in the future we'll basically put a robot in front of a laptop watching YouTube videos maybe could watch like shows like like yours and basically learn learn from that.
Uh so it's exciting that maybe seeing this kind of like intersection of like video and robotics that that might lead to kind of like some interesting results in the in the future. They might be watching already. Who knows? Um, on on the video, I want to stay on the video training side.
Uh, what makes for a great uh open-source video training data set? Um, uh, I'm I'm interested to hear how you processed the news of meta acquiring scale AI and the idea of the human defining what good data looks like. I'm familiar with what that process looks like in the RLHF world, in the LLM world.
You know, you're basically creating rubrics for grading answers to clear questions. Does this follow the right format? And then you might have the customer or the user give a thumbs up, thumbs down. Did this answer my question?
Um, in in the video context, are you having a human watch a video and then tag it with text or is there other metadata that's important? The physics thing seems harder to define. Um, basically my question is just what what what makes for a great video data set? I think it's still an open question.
Um I think nobody really knows because it's quite quite early.
uh and at the early days of cycles like that uh you usually don't care so much about the quality of the data but the quantity of the data right so I would say that what makes a great video data set today is its size right is for it for it to be big but progressively similarly to what we've seen in text u as you're going going to start to see more specific use cases uh specialized models when you're going to start to hit some sort of a wall on the data you start to focus more on the quality of the data set and then we'll we'll learn how to how to make them better.
Um, one one thing that we're seeing on hugging face which is quite uh uh maybe not controversial but counterintuitive is that actually uh you know it's not going to be just like one data set or one model or not even kind of like a dozen data set and dozen models.
there's going to be kind of like millions of data sets and millions of models just the same way you think about GitHub repositories and code repositories where there isn't really kind of like one GitHub repository to rule them all, right?
Like every company, every use case has its own kind of like specialized customized code repository. Uh one thing that we believe is that ultimately there's going to be uh millions of different models, millions of different data sets and every single use case is going to be optimized, customized for its uh own use case.
Yeah, I I feel like there's a little bit of tension there though between like if scale matters on the data sense side, how are you possibly betting on thousands of small open-source data sets versus Instagram, YouTube?
Like it just feels like when I think about who's really going to dominate in the future of generative video, it's got to be the platforms that are ingesting every image and every video all the time forever. Yeah, in my mind it's kind of like a timing thing.
At the beginning when we starting on a cycle like video, we almost have to brute force our way into intelligence and so you know quantity matter. Yep.
But progressively as as you want to optimize more for example when you want to optimize for cost right you start thinking okay how can I train a smaller model that is going to be faster that is going to cost less money because for example if I want to do a banking customer support chatbots I probably don't need it to tell me about the meaning of life right I just wanted to tell me about you know my bank accounts and so I can use kind of like a smaller more customized model so I think you'll see both phases in video a little bit the same way that you've seen both phases in in text where you started by the biggest models and now open AI or entropic when they release model they don't really actually talk about the size of their models anymore and I wouldn't be surprised if actually behind the hood actually the size is going down because I think at at some point you start to want kind of like a optimized customized and that's when you start to see more models more diversity uh also of the data sets.
Yeah, this is the information efficiency thing. Like a human doesn't need a trillion hours of training to learn how to speak English. A kid can learn that in a couple years. Um, and so the algorithms eventually will will get there. Jord, do you have a question?
Um, George Hotz was on the show last week talking about how he's been seeing venturebacked founders uh want to just check the open source box and basically say like, "Oh yeah, we're we're open source. " just kind of because that's the cool thing to be. Are are you seeing that too? Where do you think the line is?
Are you a beneficiary of that? Yeah. Yeah. Yeah, we're seeing that even though uh you know I think I think there's still a way a way to go especially for like big uh US tech companies.
I think if if you look so I've been I've been in I for eight years now and if you look at the cycle like 2016 to like 2021 2022 like the big tech companies in the US they were doing so much more open science open source and uh in many ways that's how uh the US got the leadership you know because you know Google releasing transformers and then the t of transformers becoming chat GPT and and kind of like building on top of each other that's how you accelerate uh the progress and kind of like building on top of each others.
Um it has kind of like definitely slowed down if you look at the big tech companies in the US. Uh but fortunately I think startups are kind of like uh compensating and and filling the voids in a way and um and I hope that you know big AI uh companies in the US also will change a little bit evolve a little bit.
uh open AI obviously Samman said that they would release an open source model at some point so excited about that. What are your expectations for that model? Do you have any predictions?
Uh my expectations are quite high uh just because of the history of OpenAI right when they tend to release something they tend to release something uh quite quite good. Um, so I I hope and I suspect that they could uh release something quite transformational u in the uh in open source.
Hopefully uh you know I' I've always said that you know if we had like the equivalent of a deepseek but in the US it could be 10 times bigger than than deepseek. So hopefully uh you know if we can have something 10 times better bigger more impactful than deepseek would be it would be fun I guess. Is that not Meta Lama?
What do you how do you describe what's going on at Meta right now? Yeah. Well, I mean, um, it's it's easy to dunk on on Meta, right? But, uh, I I think they they're the most open, uh, big technology company in the US right now. They really changed the field with with Lama.
they really kind of like uh boost gave a tremendous boost to kind of like the open source community and and they still share a lot of their things in open source. I think open source is what uh brought them to the frontier, right?
Like before they started to release in open source, they were quite far behind in terms of AI race and now they're very much in the race. Uh which which is great. The fact that they, you know, haven't released something uh massive uh as a follow-up yet to me uh shows how hard it is to build at the frontier, right?
It's it's not easy even for a big technology company like like Meta. Um and so I think that that speaks to to that. I'm excited to see all their new efforts, all the resources they're investing in in AI right now and hope that they can keep sharing with the community and open science and open source.
Uh what was your reaction to Apple's announcements at WWDC around ondevice intelligence or or inference? Was that exciting to you? Do you think that's going to be a catalyst for more developer activity or and what what are you seeing so far? I think it's it's quite quite early.
Uh but uh but quite excited about on device. Um I I suspect that maybe um you'll have a higher percentage of compute on device for AI than you did for software. Uh just because of some of the need for uh for speed uh for privacy.
um and uh the constraints in terms of cost right if you think of you know chat GPT on device uh what's amazing is that it's totally free right like you don't compared to the really high cost that the chat GPT has right now not only for the customers but also for OpenAI to to run uh it's totally private in the sense that you can say anything and and nobody's going to see uh what what you what you're seeing and and potentially could be quite fast.
So, I'm excited about on device. It's it's early, especially in the technology side. I think there aren't a lot of ondevice devices that really uh um make it okay to to run some of the great uh models. But, uh it's progressing really fast and and can't wait to see what's happening in this domain in the next few months.
That's a pretty huge switch because it's basically 0% on device right now for AI inference. And if you're predicting that it's going to go beyond 50%, that's a huge that's a huge shift. Uh I mean I have a follow excited about robotics too.
It's kind of like a little bit of a segue from uh from um from one device, but um we uh we we organized a weekend ago um what has uh turned out to be the the biggest uh hackathon for uh open robotics that thousands of people participated in from uh over 100 different locations um and built kind of like open source robots.
And we're definitely kind of like seeing something happening there. Kind of like the conjunction of like cheap cheap hardware open source plus kind of like new capabilities for for AI could be kind of like the perfect combination for some sort of Chad GPT moment for robotics.
Do you think it's a straight shot to humanoids or do you think that we'll see kind of a Cambrian explosion of Nat freedman core bots that you know will pick up a single leaf at a time and are more use case specific? Are we going straight to generalizable? Because that was the chat GPT moment.
I feel like there were great you know machine learning models for ad inference and recommendation algorithms like we had AI but it was very narrow. Chhat GBT was very broad. Which one do you think is coming first and what are the relative timelines? It's a good question. I'm not sure to be honest.
I think it's still undecided. Yeah. I hope it's more future of more like a diverse uh robots, diverse kind of like models.
Uh, but I'm obviously kind of like biased on on this uh because I think if you only have like one type of blackbox robot uh at your home uh that's kind of like in like millions of uh houses is kind of like a scary uh scary world, but uh but it's it's hard to tell. It's hard to tell at the moment, I think.
Uh what are uh key breakthroughs that you're looking for this year? any any kind of predictions uh new catalysts that kind of thing. You have a crystal ball in the office. I wish I wish I had a a crystal ball crystal ball. Um I'm kind of like bored with uh text and and chat bots. Sure.
Uh right now I think there are a lot of people working on it and we've kind of like uh reached this point where it's a very Don't say plateau. Don't say plate. No, not plateau, but incremental, very incremental improvements that uh are a little bit boring. So, I'm much more excited.
We can still invest billions of dollars in it, right? Right. It's fine. We can definitely still deploy. Yeah. Yeah. So, I'm more excited about harder domains like you were talking about biotech just before. Yeah. Biology, chemistry. Uh these domains are super exciting.
Uh today um I don't know if you've seen Arc Institute released um a cell uh um perturbation uh prediction model on on hugging face and on on GitHub um which I'm super excited about.
I think it's uh obviously if you can predict kind of like the perturbation and the evolution of the cells especially when they react to drugs and things like that it can have quite a big impact in terms of uh drug design and and things like that.
So really excited about these kinds of things more kind of like biology chemistry and and how to apply AI there. Um what how much economic value do agents create at internally at hugging face? monthly basis. Um quite a lot.
Um the so we have this thing called spaces which is uh our kind of like AI app store uh where is like over half a million uh open AI apps that uh that people are are building. Uh and it's integrated with our NCP framework. So you can add that in your uh chat in your chatbot.
Uh, and that's kind of like the most used thing internally at at tugging face where um people are going to use cursor or or chat GPT and and kind of like call some of these more specialized AI apps uh to do kind of like specialized tasks uh for for them.
Uh it's hard to put a number of it on it, but uh yeah, it's it's quite transformational of course.
Do you think we're going to have this uh period where like do you find it kind of fascinating that uh you guys are getting a lot of value out of agents internally and that that uh like how do what do you think a agend agent adoption will look like in businesses because it feels like um I don't know how rapid it feels right now.
It's rapid in the in the bubble that we're in. People are trying a lot of stuff but maybe they're turning quickly as well. So I'm curious what you think adoption will look like. Well, I mean I think ultimately agent and AI are kind of like similar evolution of the same same trends, right?
And I think in terms of user interface, they're actually going to merge and and I'm not sure there's going to be so many kind of like different different ways of interacting with AI and and agent.
Um but uh they they're going to definitely go go mainstream just because of the very nature of a product like Chad GPT that has already gone mainstream and and where I'm sure kind of like open eye will bring more and more agentic workflows in into so yeah I think I think uh it's definitely going to go mainstream faster than anything we've seen before.
uh now the question will be you know do you use a very complex uh agentic workflow 1% of the time uh for 1% of your queries or for 10% or for 50% or for 90% and I think uh we'll see that based on the development of the technology and the capabilities I think if the agents are much uh better than kind of like a series of queries then I'll I'll use that and if not I'll stay on my uh past way of of kind of like doing simple queries.
Totally. Last question for me. Um now that uh Meta owns something like 49% of scale AI, uh the budget for data generation has to be significant at Meta. Is there a specific data set that you'd like to see Meta harness scale AI to produce and then open source?
Um I mean I think uh biology chemistry data sets are still very much lacking. Um is there like a more specific example like within biology or chemistry like what would you actually use an army of humans to go and categorize?
Would you need biologists and like PhD postgrad type work or are we talking about something that someone could do even with just like basic ski a basic skill set? I think it's an open question. I I don't really know to be honest.
I think if we if we knew what would be kind of like the ideal data set there, we would we would build it our ourselves because we also built some some data sets ourselves. That's certainly the story of Alphaold.
had a fantastic data set for Alphafold and then they were able to do reinforcement learning against it and that's what really solved the that incredibly hard challenge. So if you don't have the data Yeah. Yeah. But my main Yeah.
My main point would be you know maybe to focus a little bit less on just text and just chat bots and and kind of like focus a little bit more on other domains.
I think that's when you're going to kind of like uh unleash a lot of the additional impact uh and a lot of the I think positive use cases for for the technology too. Totally. Well, thank you so much for joining. This was fun. Appreciate all your insight. Thanks for having me and uh hopefully have you on again soon.
Sounds good. We'll talk soon. Thanks so much. Let's go back to the timeline while we wait for our next guest to join. Unusual Wales says Deoit's US employees can now buy $1,000 of Lego on the company's dime to boost their wellbeing. We're already doing that for Deote. We get We get employee wellness looks like. Yes.
Challenges. It's not just policies in a PDF somewhere. It's not a gym membership. It's Legos on the table. Yes, Legos on the table. You're in a meeting, you're stressed out, just start making Legos, right? Can uh oh, it looks like uh Tyler over on the intern cam has a few Lego sets that have been sent to us.
Our Anderal Lego set buildout was fantastically successful. Uh Tyler did it in what, one hour and a half. Current record holder. I think it was 119. Who's counting though? Current 119. Current record holder. And walk us through uh what companies sent you stuff. What will you be building next? Yeah.
So, first we have the uh Epirus. Um, this is Epis. Um, got a little counter drone system going there. That's awesome. This one looks like a lot of fun. And then the other one is uh from Solugen. Solugen. Very cool. It's a bioforge molecule factory. Yeah.
Now, everybody should have a We'll come back to you later because our guest is here. But uh I want to know because it seemed like Ander went straight to the Lego factory and had a had a design document and guidelines. Those look like they don't even have instructions. So, I want you to dig into those and see.
I want I want new estimates for how long you think that'll take you. Okay, cool. Good luck. Anyway, he's like, "Not another Lego set. I can't possibly. I thought I thought I thought I would be so I thought that was good viral bait. "