Handshake's Garrett Lord on supplying PhD talent to frontier AI labs: 3x growth in a month as Scale AI's dominance fades

Jun 18, 2025 · Full transcript · This transcript is auto-generated and may contain errors.

startup update. Kicking it off. Welcome to the stream. How are you doing? Doing super well. Thank you for having me. Uh can you kick us off with an introduction? What do you do? Hi. Yeah. So, my name is Anish. I'm one of the co-founders and the CEO of Traversal.

Uh where we are basically trying to build an AI site reliability engineer. So, when you have software systems, they break and we try to help troubleshoot why they broke and how to fix them.

Is is is the current status of the market entirely uh human site reliability engineers or are there like a suite of tools that are you know just not AI enabled enough and and uh and and you can kind of piggyback piggyback on. Yeah. So I'd say the current state of observability I call it observability 1.

0 has been putting a lot of eyeballs on your data.

So essentially the the great iconic companies now like data dog and Splunk and graphana and dino trace um and what they basically did is help they store your data and then help you visualize it through you know numerous dashboards but the toil of actually figuring out for an on call engineer what happened is what I call dashboard dumpster diving that's what they still have to do right and there isn't that intelligence layer of actually helping you search through this pabytes of data and I think that's what this new generation of AI agents and LLMs and some of the research we have done in our PhDs uh unlocks Yeah.

Have you been tracking? I mean, it feels like site reliability engineering should be bene a beneficiary of AI and yet it feels like we've been seeing more and more outages like Cloudflare went down, Google went down, we had another outage while we were doing the show. The whole internet went down.

It feels like it's happening more and more. Is that just me being more sensitive to it or more news or I'm just online more or are these systems actually getting more fragile? Yeah, I think two things are happening.

One is as humans we like to push the limits of everything we can do and so we're always going to be pushing the limits of software systems and they're getting more complex there's more microservices and that was happening before the age of of um LLM powered code agents right and now with everything happen in the world of AI software engineering with cursor and get a co-pilot and windf and so on so forth I think the amount of code that's being written is there's going to be like a camrian explosion right where I think most of it is going to be written by AI systems and so the system is going to get even more complex and less and less is going be understood by us who are as engineers.

And so I think it's going to be even harder to debug things because we don't actually understand what's written and it's more complex.

And so I think I think that's actually going to throttle the use of AI software engineering tools and mission critical systems because if I'm ahead of infrastructure, I'm like I don't I don't trust this. I'm not going to allow this to affect my core infrastructure. And so I think you need tools observability 2.

0 to help actually be in line with it.

I guess um h I feel like finding truly great SRE is like a massive challenge like how how do you like you know if you can deliver on the product side like how much bigger do you even think like the market is like the market demand for great you know basic effectively you're it sounds like you're wanting to offer productized S sur talent um how much how much like demand is out there for for for what you're building Yeah.

So, first of all, like we've been trying to hire an excellent SR and we still haven't been able to. There's so there's a massive labor market discontinuity where just it's impossible to find fantastic SR.

Um, but I think if I think of the role of an S sur, um, the way I think about it is like it's it's all about the health of software systems and I I think of it as like the master hierarchy of needs. So, in your level one, you you know, you have a heart attack, you got to deal with it right now.

And I think that's what having a production incident feels like and that's where SR have to spend a lot of time where everything else nothing else matters. Level two is when you have these like constant stream of alerts happening.

That's the equivalent of having like some sort of like chronic condition that you have to deal with all the time and you can't think like a year in advance. And only when you can deal with these things then you can start thinking about like your long-term health and like life hacking and you know optimizing your sleep.

And that's when I think you can think about the long-term health of a software system how you would architect it over the next four 5 10 years. And so I think all of the time right now for SRRES is being spent in like having a heart attack and a debilitating condition.

What I think they should be doing and the best SRRES is thinking about the long-term health of a system. How do I architect the system for the next 5 years so that it's robust? And I think that's what they want to be doing, but they're forced to be on call all the time.

And so I think that's what we want to stop them from having to do and they can spend their time on the more meaningful, you know, aspects of their jobs. uh talk about the trade-off between fully agentic systems here versus uh more of a co-pilot dynamic.

Yeah, so I think if I think about it, there's there's I think of it as a 2x two. So think of the the y-axis as uh the severity of the of the issue. So you can have alerts that are happening all the time to really complex incidents. Um and then on the other hand, you have how agentic is it?

So on the close to zero would be you have a runbook you have some sort of playbook that exists and you need you have humans or engineers go execute those those runbooks and um on the other end you have a you know where you really don't know what to do and so you have 50 people in an instant war room trying to figure out what happened right so if I think about for low um value alerts typically you have a playbook or runbook that engineers can go execute and that's I'd say less agentic because you can just kind of the meta workflow is always the same which is like look at this dashboard then look at this logs and you know correlate between them or something and on the fully agentic side is when uh even teams of engineers have no idea how to solve it a priority and that's where you need to come up with architectures that are truly novel where it doesn't fall into a pre-existing runbook so that's I think the type of architectures you have to build to deal with the most visible incidents that uh an enterprise faces I think that's really been our focus of the company last question I was going to say are you the first assistant professor at Colombia to raise you $48 million in the first two financings of a of a startup.

Do you have any is it is there anybody that's got you beat there? I don't know. But Colombia has been an absolutely wonderful partner. I think they they've been very supportive. I think they understand the importance of AI and and how in some in my opinion this is the industrial age of AI.

So if you want to be a good researcher, you have to be in the weeds learning how these systems work. You can't just be theorizing in a vacuum. And I think everyone myself and Colombia understands that. So they've been incredibly supportive institution.

and I'm yeah I hope that I continue my relationship with them over a long time. That's awesome. Talk about the traction to date. You guys have been in stealth. I'm sure you're going to get a flood of of inbound interest today, but uh yeah, who who are you guys building and and working with so far that you can share?

Yeah, so we've been um I can talk about a few that I'm allowed to speak with and some I can't, but they've all been some of the largest enterprises in the world who basically any company that cares about downtime.

So those turn out to be companies like infrastructure companies, payments, streaming, financial institutions because if you go down your customers feel it immediately. Those have been the customers that we've typically uh focused on or they've been attracted to us as well.

Uh and we found that actually as you go to the larger enterprises, two things happen. One is they actually have all the data cuz if you don't have the data, we can't do anything. Um and so the the the instrumentation is mature, but at the same time it's fragmented.

You have like so many teams, so many different services. And so I think that's where an AI agent that can talk to all these different systems at the same time adds a lot of value. So we've we've found ourselves more and more uh pull towards the enterprise.

Um and yeah, it's it's in deployed in a number of you know um mission critical environments. People are using it every day to help them troubleshoot pretty complex incidents. I think that's what we feel uh proud of.

I'd say it's great that this product sort of self- selects for really high quality customers that are going to like, you know, once they onboard and they're getting value, we'll probably be customers for, you know, years and years and years. Um, so very very exciting, John.

Uh, uh, let's I think it's time to hit the gong. Absolutely. Not going to he's not going to come on the show and we go and $48 million. Congratulations. Fantastic lineup of investors. Uh, Sequoia, Kleiner. Great great group. Great group. Fantastic. Thank you for coming on.

Uh come back on when you have news and congratulations on the milestone. Have a great rest of your day. Up next, we have John Lee from Sapphire Building uh uh Electric Mobility. Um very excited to talk to

← Back to story