Chonkie: Open-source document chunking for LLM RAG pipelines — 180K downloads, 200 projects, and LlamaIndex dependency

Jun 11, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

Featuring Trey

fantastic. You guys come on when there's big game names. Games correspondent. I'm going to take this hat. Enjoy it. All right. See you guys. Let's bring in the next the next person. This guy 2025. Eat data. Make chunks. Make chunks. Welcome to the stream. I'm John. Nice to meet you. Nice to meet you, John.

Can you introduce yourself? My name is Trey. I am the co-founder of Chunky. Chunky. Chunky. Chunky. That's the name of the company. Yes. What do you do? We take really complex documents.

We split them up into meaningful pieces such that one piece is one idea and then we send your LLM only the data it needs to answer questions. Give me an example of a really complicated document. Financial reports. You've got graphs. You've got actual text data paragraphs. You've got tables.

And if you're asking are so annoying because there's like so much boiler plate. You need to just skip to the right thing. Exactly. Yes. Exactly. And like most of the time when you're asking questions to an LLM, you really only need one table or maybe you need a summary. That's it.

Why don't I just throw all of that in a big context window Gemini 1 million tokens or something and then just ask it what's it'll work with like maybe one PDF but you have a whole database of PDFs you got like 100page PDFs thousands of those tous thousands of those you got schematics which are really complex models get confused we actually ran this eval yesterday after the price drop on 03 which is we took relatively simple documents we took classic literature you know David Copperfield L everything like that and we gave that to 03 we asked very pointed questions 03 got a retrieval accuracy of 75%.

Great. We chunk the data through chunky. Then we asked 03 the same thing. Always 100%. Always chunk. This is my favorite name since last YC batch which was a company called Pig. Yeah. You I'm setting a trend here. You just like large animals. I mean it's just so it's just it's going to stick.

We're going to we're going to be talking about this next demo day. I'm talking about the pipeline. I'm I have a bunch of huge PDFs on S3 or something. I feed it into your uh to your system. Am I getting uh a Postgress table? Am I getting a MongoDB like unstructured uh am I getting embeddings waiting?

So it's like a vector database. Yeah. So you get embeddings out there. You can put it on your own vector database or we can also wrap around your vector database. That's totally up to you. It's really developer friendly.

The idea is to just make a dev tool that people just enjoy using and they can have it be two lines of code, five lines of code, whatever it is. So what's actually happening with it's not open source, right? It is open source. It is open source. We have an open strategy and so we are uh we are like open source first.

We started as a side project on the open source and we love the open source. So so is this something that I should be running like in like an ingest process as I'm generating new large documents. I'm chunking them and then loading them into my vector database which I'm maybe also hosting on an async chunk.

But if you're building a codegen tool then you want a live chunk. A live chunk. Yeah. Okay. And so if you're doing like codegen on the fly or if you're like you know things if you're working with a corpus that's changing all the time then you do want to you want to do it live.

You come in you came into YC with this idea. You're already had it as an open source project. Yes. Yes. We had an open source project all set up in like February and we came into YC with this idea. How many people on the scene? It's just me and my friend from seventh grade just so many.

boys group made it out of the group chat and then we're not chunky made it out of the How many chunks have you chunked? How many stars do you have on GitHub? How much revenue you making? What do you what do you got for us in the quantitative metric side?

So the metrics I really like are we've got over 180,000 downloads and we've got over 200 projects using us. We're core dependency on projects like Llama Index. Oh, cool. Um and we've got like 10 to 12 batch companies using us. Good inbound coming in from there on. Fantastic. Brown's already done. What was that?

Round is almost done. We're trying to wrap it up this week. Preliminary. Yes. Just just weighing our options and just like making sure by Friday it'll be done. Fantastic. Amazing. Well, good luck out there. Well, thank you so much. Thank you for having me. Never stop trunking. Never stop chunking. We'll be following.

What's the What's the domain? C H O N K I E. ai. And if you want this merch, it's shop. chy. ai. He's selling merch. He's selling merch. Let's bring in the next. Great to meet you the next participant of the demo day stream if we have one. Welcome to demo day 2025. We are live from Y Combinator in San Francisco.

Great to meet you. Nice to meet you. I'm John. Nice to meet