Interview

Elorian raises $55M seed to build multimodal visual reasoning models — where current AI performs at preschool level

Apr 9, 2026 with Andrew Dai

Key Points

  • Elorian raises $55M seed led by Striker Ventures with backing from Nvidia, Menlo Ventures, Automata, and Google Brain's Jeff Dean to build visual reasoning models that currently perform at preschool level.
  • Current AI systems can generate photorealistic images but fail at basic spatial reasoning tasks like counting objects or maintaining coherent floor plans, a gap Dai argues requires new architectures rather than more compute.
  • Elorian targets engineering and architecture workflows where AI-assisted design tools could compress weeks of manual CAD work and compliance checking into faster spatial reasoning tasks.
Elorian raises $55M seed to build multimodal visual reasoning models — where current AI performs at preschool level

Elorian raises $55M seed to close AI's visual reasoning gap

Andrew Dai's background makes him an unusual bet for a seed-stage company. He co-authored a paper roughly twelve years ago proposing language model pre-training — the foundational work that GPT papers cite — while at Google, where he also built Smart Reply and Smart Compose. He holds a PhD in AI from the University of Edinburgh and a computer science degree from Cambridge. Elorian is his attempt to apply that depth to a problem the major labs have largely left unresolved.

We are essentially building specialized models that includes new architectures for visual reasoning, very specialized sets of data with specialized data processing and new algorithms... You can tell these models to generate a pool table and they will make a perfectly good looking pool table. But if you ask them to count the number of balls on the table, they will just hallucinate.

The capability gap

The core argument is that multimodal visual reasoning has been treated as secondary to language, and the gap shows. Current models perform at roughly a preschool level on visual tasks. They can generate a photorealistic pool table but will hallucinate wildly when asked to count the balls on it. They can produce an architecturally convincing floor plan with crisp lines and professional formatting that falls apart on inspection — wrong room counts, no coherent spatial logic.

Dai argues this isn't a data-volume problem that more compute will quietly fix. It requires new architectures built specifically for visual reasoning, specialized training data, and new algorithms across the full pre-training stack. Elorian is building all of it in-house.

Data strategy

On training data, Dai is direct that naturally occurring data is more valuable than synthetic. Synthetic data carries the risk of pushing models into degenerate behavior — repetition loops, em-dash artifacts — rather than genuine understanding. Data representing the three-dimensional physical world is where Elorian is focused.

Target applications

The clearest near-term use case is engineering and architecture. Mechanical engineers, hardware engineers, and architects still operate largely as they did decades ago inside CAD tools that have seen little AI integration. Dai's example is floor plan editing: asking a model to enlarge a bedroom or add a house extension currently takes weeks of manual work plus compliance checking against building codes. A model that genuinely reasons about spatial relationships and physical constraints could compress that significantly.

The round

Elorian raised a $55M seed round led by Striker Ventures, with participation from Menlo Ventures, Automata, and Nvidia. Jeff Dean, the former Google Brain chief, is among the angels.

Nvidia's presence as an investor is notable given its direct stake in which model architectures consume GPU compute at scale. Jeff Dean backing a pre-training focused lab so early suggests conviction that the visual reasoning gap is real and structurally underinvested, not just a prompt-engineering problem waiting to be solved.