DataCurve's 18-year-old founder raises $15M Series A to supply high-skill coding data to frontier AI labs
Oct 13, 2025 · Full transcript · This transcript is auto-generated and may contain errors.
Featuring Serena Ge
about the work work being rewarding. But um uh it it is it is kind of like the the hammer that's there that you can hit and then all of a sudden your your country is in the black again if you're losing money. Yeah. Anyway, we have our third guest of the show, Serena from Data Curve in the Restream waiting room.
Welcome to the show, Serena. Thank you so much for joining us. How are you doing? Good. How are you guys doing? We're doing fantastic. Would you mind introducing yourself, the company, and then get ready? Yeah, of course. Yeah, data curve.
We do um coding data for the foundation model labs to develop new capabilities or improve them. Um I dropped out of school at 18 to do YC for some shitty ass idea called Uncle GPT that eventually turned into data curve. Wait, what was it called? It was called Uncle GPT. We were building like a middle-aged AGI.
It was didn't work. That's amazing. Status achieved internally. Wait, what was it all just like a like a prompt basically wrapped around Was it a classic like GPT rapper? It was some rapper.
It was like a uh we were making like web agents that would uh navigate the web, but it wasn't working and we pivoted throughout the batch. Would they keep having midlife crises and they'd go try to buy a 911 in the middle of a task? Yeah, it wasn't working well.
But yeah, Dur's been around for a year since and this is working. Uh, what what's the evidence? Have you just been getting a lot of customers? Has there been really solid progress on the product side? Like what's the secret to unlocking the the news today?
I think since day one of starting data curve, we found a lot of traction commercially where u a lot of the foundation model apps were inbounds to us. Like imagine me, I'm some like 19-year-old who has never taken a sales call and the first one is a fan customer.
Uh we've saw so much commercial traction in the market and we've always been pushed to fulfill that more and more. Uh which brings us to today. We're fulfilling it on a platform that's paid out a million dollars in bounties and we just raised our series A with hit that gone. How much did you raise?
We raised 15 million in series A which brings it to 17. Congratulations. Amazing. Uh what you you said that the focus is on coding data. Break down kind of the niche a bit a bit more. Well, I think coding data is just not any niche. It's like the niche, the thing that's working for LLMs.
Uh the thing that's working commercially. Um so when we think about coding in the whole LM space is like the most important capability. We do all the post- training data for improving coding agents or existing capabilities um like anything that you see in um the soda models.
We're looking to improve that or just uh maybe even develop new capabilities with the labs.
what what what with encoding is a place where you actually need to go get new data because I imagine there are a million answers to fsbuzz on stack overflow that's all been scraped all of GitHub has been scraped uh obviously the labs and and the cursors and the wind surfs of the world have been like have this RL data flywheel now but where would you actually say I need to go get new data to improve a coding agent yeah let's think about like um how this whole cycle starts right let's say we look at the current coding agents they can solve two-hour long issues.
Um but what if we want them to solve seven hour long or like more long horizon tasks perhaps for us data curve we want to provide um like reinforcement learning with verifiable rewards tasks so that these uh models can learn on longer horizon tasks with verifiers that might be one thing or maybe uh people want to develop more multimodal coding agents so maybe we want to do some data science um RL or SFT tasks for the labs okay so uh like one piece of data might be like a a really clean example of a 7hour task that then can be used to train a model that could do seven-hour tasks regularly basically.
Is that the idea? It could be that that would be for supervised learning but you can also have the verifiers for the seven hourong tasks or intermediate verifiers but whatever however the labs want to train that. Yeah. I mean uh this was framed as like taking on scale AI.
A lot of people think of scale AI as um like m mechanical turk almost very basic tasks like short short time horizon very you know anyone could go do a scale AI task that's at least the narrative uh and now we're in the world where there's companies that are doing like experts on we're looking for paleontologists to like teach the LLM everything about dinosaurs or you know what what you said and then there's also the folks that are designing RL environments with verifiable rewards there's no humans in the loop at all.
Where do you sit on that continuum? Uh we definitely sit on the high scale task.
We are not in the long tail like oh we're not going to find you urinologist uh living in Egypt, but we're going to find like the bulk of the where the good coders are and I think that captures most of the important tasks that they're able to do and allow us one to train on.
Um yeah, it's definitely uh very high skilled and I think the premise here is like I don't think any highkl software engineer wants to be a data annotator. Um, and then so you have to, let's say, pay them a law on contract to get them to even agree to that.
Uh, but maybe they're not going to do a great job if you got them, uh, to go on contract, too. Especially at scale, you just get a bunch of Googlers coasting at their jobs and you pay them 200 an hour. Um, so we do like a bounty based system for these very cracked people. Um, they're having fun.
They are also making money and they're also upskilling for something they're passionate about. Um, and it's definitely like very high skilled human data.
We heard that uh one of these uh one of these firms that uh you know gets human labeled data uh basically runs the entire company on a slack channel and they've created a lot of value and scaled a lot but they they just don't even need to build a product yet.
How how important do you think it is to actually engineer a durable product versus just solve the matching problem between labs want this type of person I go get that type of person I just introduce them? I think there's a very big uh component of pipelining and also experience that retains people.
Um so the more complex data you need, the more guardrails you need for someone who's well-intentioned and wants to do a task, you want them to be handheld throughout that process and also make that process engaging. Let's say especially if we're training somebody up to uh do a seven-hour long task.
Uh you better have a good experience or they will drop off. Um so I think and then you also need different validations here and there to uh steer them. So pipeline developer experience something we focus a lot on. Um so yeah I think pipeline is very important.
The whole experience is important which is how we uh use gamification and also make better dev tools. Um yeah if you match people then it relies on the labs having that pipeline internally but for us uh we do that all inhouse and we sell the final data not the labor.
What uh you said you got into YC at 18 is that correct? Yeah like right around my birthday turning 19. Yeah. Fantastic. Do you remember what your answer to the question about what's a real world system that you've hacked was at the time? That was like three years ago. Um, yeah. What was that?
Oh, I think it it might have been that I skipped high school and got perfect grade or something. You skipped high school? Oh, you mean you just like didn't go? Yeah, cuz it was co so I just record everything and play back on 4x speed and then didn't really do 4x4 is so crap. This is who you're competing against.
It's amazing. Yeah, you you you've taken like the worst critique of like Zoomer brain rod, but applied it just to like