Monogram launches voice-first AI app with visual UI output, raises $40M from DST and Lux Capital

Jun 30, 2026 · Full transcript · This transcript is auto-generated and may contain errors.

that will acrue to the labs or the biotech companies. Who knows? We'll figure it out. Maybe both, maybe everyone. Hopefully, lots of consumer surplus. But we have Aaron Bali from Monogram in the waiting room. Let's bring him in to the TV Ultra Down. Aaron, how are you doing? Welcome.

Thanks, guys. It's great to be here with you guys.

Fantastic. Great to have you. Finally. Long overdue. Uh, talk about the launch today. Talk about the fund raise. Introduce yourself a little bit. Tell us about Monogram.

Okay. So I'll try to do it in a different order.

Whatever you want.

I'm the um I guess most people know me from the two other companies I started. Uh Udemy, the online education platform

went public. Hopefully most people know it by name. I started common health.

Yeah.

Big tech enabled healthcare company

and now the new company monogram we we've been we have been very stealthy about it intentionally.

Yeah. Yeah. Essentially what we like which we launched today just like I asked five minutes ago

congratulations

thank you it's a general purpose

so loud congratulations

sorry now we can continue

okay so

jury's on one today just continue

yeah it's a general purpose AI application for everyday use Yes.

But what makes it different is

with other AI applications, you usually have a chat based interface.

Yes.

You ask a question,

you write some text, you get some text back.

Yes.

With monogram, you ask something and then you get a complete visual interface back.

Yes. It's essentially like a like a it's like when you think about computers had command line interfaces back in 80s and then we switch to graphical user interfaces where everything became visual interactive. We are trying to we are we will be be the first company to do this in AI.

I love it. I I've been seeing glimpses of this with uh chatbt images. So I wanted to know the the rate of air conditioning in every uh European country because there's this whole debate over France. they don't have air conditioning there. I wanted to know how France stacks up. And instead of asking for a text output, I just said, "Go get the data and then generate me an image, a graphic of that." Of course, it's not interactive. It's not HTML-based. It could be so much richer. And people are seeing glimpses of this with Vibe Coding, but it hasn't come to the consumer. So, I'm interested to know uh have you found a beach head? Have you found a killer use case? Like I feel like

Yeah. I guess assuming like assuming every single person listening uses AI 20 times a day, you know, even if it's just casual asking questions, etc.

What are the what are the kind of prompts that you would traditionally take to

a a a normal language model that that people should try with Monogra?

Yeah.

Yeah. So the first of all like if if I explain this is one of the things where when you see it it becomes very obvious.

Yeah.

But it's hard to sometimes describe but the big difference is we generate a visual interface response in roughly one and a half seconds.

So you ask a question and you immediately start seeing something right? So it is not a do this work make some research and get me back with a dashboard or report. It is I ask a question, I lift my finger from the audio button

and then like in a second you start seeing the the

that's really cool. So what we to be honest we didn't know what use cases would really work with this technology best

but what we realized as like testing the application ourselves is it works best for the like very very simple day-to-day use cases like I for example say okay the Oscars are live so which movies were nominated for Oscars this year right can I find something to watch and maybe some thing from a South American director just really openended questions like where do I eat? Is there a in and out on my way to the airport from here?

Yeah.

Right. So all like this like day-to-day questions and then like what you realize is like yes like if you ask the same question to ch it will also answer right behind the scenes we are using uh we are using open model.

Mhm. But we do quite a bit on top of it to make this response very interactive like you get information architecture right you get a lot of interesting visuals and when you tap on something you get even more information right so it's really just a dayto-day search discovery basic questions that's where we actually start seeing the most um like kind of enjo

yes

and talk about talk about that I mean I John John was like such a is such a power user of voice

took I was I was a lagard but now I don't even like to prompt without voice cuz I I can get a much better result if I just talk for

20 seconds and then give it a bunch of context and then it'll give me back you know closer

I I'll answer that from first principles right so I'm I'm a first principles person so talking is a lot faster than typing right roughly four or five times more information can be conveyed with audio per second.

So, but the problem with voices when you do voice mode, you have to you have to listen. Listening is very slow. Listening is maybe four times slower than reading.

Ah, that's interesting.

But if you think about them, there's something even faster than reading. That's actually seeing. M

so if you can the fastest most the most like I would say frictionless experience for me with AI is voice in and then visual output back. Yeah,

except when you have visual output. Sometimes you can tap on things, right? Tapping on a card is a lot faster than saying can you get me more details about this thing, right? So,

so we support every form factor, but we design everything around audio as the primary input, camera will be the secondary and keyboard will be kind and a third input. They they all work but really like in the output like we we we do have some audio output like we have a short answer but m like maybe 90% of the information is conveyed through the visual interface and you're getting a short answer like kind of like with a text to speech back.

Yeah. So cool.

I wish I could bet on how quickly Apple tries to buy you

because I I' I'd be like I'd be like term sheet by end of week. So yeah, I mean the your this is your third huge company. Like do you have I I mean I guess it's like where where do you want to go with this? What are your aspirations? Because you've checked a lot of the boxes on the entrepreneurship journey. Um but then I'm also interested in like it's not you're not building the same business three times in a row. There are certain founders that do that very successfully and it's impressive. But uh between Udemy, Carbon and now this uh very different companies. So what lessons what strategies are you actually taking through the three companies that you feel like oh yeah I'm I'm running you might not think it but I'm running the carbon playbook in this particular area or I'm running the Udemy playbook here a little bit. Mhm. So the this the common thing between those three companies was that like what my brain get obsessed on is how do I make something that is complicated and has high

entry barrier and make that accessible to more people.

Yeah.

With Udemy when everybody was focused on higher education online degrees we focused on how do I learn Photoshop like in a week right? So it was the most like mainstream of the education companies and that's where we got the ton of success and the difficult part was like not being overly caught up with what was hot in Silicon Valley at that time right with carbon health when everybody was going after premium cus healthcare we were more focused on how do we make the healthcare that everybody can access better and I think here right now all the focus on AI has been uh the coding agents and how to make really complex enterprise workflows work. But what I'm really obsessed on is if I make AI a little bit easier to use, it will just benefit more people.

So it's just like again there's not a ton overlapping as you said like I also do like a new challenge for myself. So I do like rotating things over and I would say healthcare was difficult. So it is nice to do software only

pure play consumer app. This is expert mode.

Yeah,

it is.

Exactly. Uh, how are you thinking about image models, diffusion models? I saw a really cool demo for something that looked like a website, but in fact was real time diffusion. You probably know the one I'm talking about. It looked like a sort of a beige citycape and you could sort of zoom in and it was all being rendered in real time. Is that something you see yourself implementing in the future or is it a distinct different arc of the technology tree? So the answer is yes and no.

So I don't believe AI outputting pixels as the primary form factor. That that's that sounds cool like maybe 10 20 years later

it might be the reality but in the near term like you will want

uh something that's more of a representation of the user interface as the as the layer. So I think pixels is probably not the form right output like medium. But diffusion models I'm obsessed with. So every time a new diffusion model comes in, we say, "Oh, is this the the day we switch from LM to diffusion based models because that sounds like the right architecture for what we are doing."

But uh they haven't been realistic enough like we have been trying everything, right? So essentially the way monogram works is there's one major model, right? Because when you ask a question, there is language in the interface. So you just need a language model or something model that can do language. But then there's a lot of like maybe 20 30 parallel things are happening at the same time.

So it's a very different agent architecture.

So we do have a we have some of our own models some fine-tuned models but one like big one. So I think eventually that that main model will be diffusion based not like transformer based but not today like I think that their tag is still uh not good enough to be able to handle this.

Yeah maybe in the future. Um super cool uh in the app store now iOS exclusive Android as well. What are you thinking? It's iOS exclusive we'll we'll be working on web application desktop Android but part of the idea here is because the entire interface is built by AI we are just building clients like this will become the most crossplatform

I imagine it has to be so fast especially in the age of vibe coding like porting can has to be so simple now Instagram took like five years to get to Android I imagine you have a shorter timeline so uh good luck for the

tell us about the round. Yeah. Tell us about the round.

So, the round was led by uh DST and Lux. It was a $40 million RC round.

Woo. You already

you hit the gong for the launch and you hit the gong for the round as well. It's a double gong day. Thank you so much for coming on the show. Fantastic.

Very cool. I'm super excited to play around. We'd

love to have you back on the show and chat more soon. Have a good one.

Yeah. Great to see you again.

Thank you.

Cheers. Let me tell you about Figma. If you're designing the next banger app,

← Back to story