AI & Science · Architecture · June 2026

Worldview: Inside AI's Race to Build Machines That Understand Physical Reality


A language model can describe a glass tipping off a table — the words "shatter," "spill," "stoop to clean it up." A world model is being trained to know, in some structural sense, how the glass falls. That difference, researchers increasingly believe, separates a very good autocomplete from something that can actually plan, imagine, and act in the world.

June 7, 2026 By Lisa Pedrosa 11 min read Architecture · Frontier AI
SIMULATED SPACE

In a converted warehouse near Place de la République, an AI system watches a video of a tower of blocks falling — and is asked, simply, to predict what the next half-second of footage will look like. It isn't asked to caption the scene, summarize it, or write a poem about gravity. It has to know, in some internal way, what gravity does. This is the strange, quiet shift now underway at the edges of artificial intelligence research: a move away from machines that talk fluently about the world and toward machines that are starting to model it.

For three years, the dominant story of AI has been the large language model — systems trained on oceans of text that learned to write, reason in words, and hold conversations with unsettling fluency. But a growing chorus of senior researchers, including some who built the foundations of that very revolution, now argue that language alone is a dead end for building systems that can truly understand and act in physical space. Their answer is something called a "world model": an AI that doesn't just describe reality in words, but builds an internal, predictive simulation of how reality behaves — how objects move, collide, occlude each other, and respond to being touched.

The shift has a name, a growing roster of billion-dollar bets, and, as of this spring, its own exodus story. In January, Yann LeCun — one of the three researchers often called the "godfathers of deep learning" and, until recently, Meta's chief AI scientist — left the company to found a new venture, AMI Labs, devoted entirely to building world models from scratch. He didn't leave quietly. LeCun has spent much of the past two years arguing, often bluntly, that large language models are "a distraction" from the real goal of machine intelligence: systems that can plan, reason about cause and effect, and learn the way a human toddler learns — mostly by watching and poking at the physical world, not by reading the internet.

$1.03BAMI Labs' seed round — the largest in European startup history
24 fpsReal-time generation rate of DeepMind's Genie 3 world model
1M+ hrsInternet video used to train Meta's V-JEPA 2 visual model
$5BValuation World Labs is reportedly seeking after launching Marble

What a "world model" actually is

To understand why this matters, it helps to see what large language models are actually doing under the hood. An LLM is, at its core, a very sophisticated next-word predictor: given a sequence of text, it estimates which word is statistically likely to come next. It can describe a ball rolling off a table with startling eloquence — because millions of human-written sentences about balls and tables and falling are baked into its training data. What it cannot do, critics like LeCun argue, is simulate the event. It has no internal sense of the trajectory, the bounce, the way the ball might roll under a couch and stop.

A world model flips that approach. Instead of training on text, it trains on raw sensory experience — usually video — and learns to predict what happens next not in words, but in a compressed, abstract representation of the scene itself: where objects are, how they're moving, what's hidden behind what. Meta's V-JEPA 2, released in early 2026 and trained on more than a million hours of internet video, is one of the most prominent examples: rather than trying to reconstruct every pixel of a future frame (an enormously wasteful task), it predicts the underlying structure of the scene in a compact "latent space," the way a person doesn't consciously track every photon when they catch a thrown ball — they just know, intuitively, where it's going to be.

Google DeepMind has taken a more dramatic public swing at the idea with Genie 3, unveiled this year as the first real-time interactive world model: type in a description, and it generates a navigable, persistent three-dimensional environment you can walk through and manipulate at 24 frames per second, improvising new physics and geography as you go. Fei-Fei Li — the Stanford computer scientist famous for building ImageNet, the dataset that helped launch the deep-learning era in the first place — has gone commercial with hers: her startup World Labs shipped Marble this year, a world-model product that generates persistent 3D scenes that can be exported directly into game engines like Unreal and Unity, priced for studios at $20 to $95 a month.

"You can ask a language model to describe what happens when you push a glass off a table. Ask it to actually predict the next ten seconds of that video, frame by physical frame, and it falls apart. That's the gap we're trying to close — and it's not a small one."
— Researcher quoted on the world-models shift, Scientific American, 2026

Why this isn't just "the next architecture fad"

Skeptics have reason to be wary of grand architectural pronouncements — AI has seen plenty of "next big things" arrive with fanfare and quietly fold into the broader toolkit. Mixture-of-experts, retrieval augmentation, diffusion, state-space models: each was, briefly, going to change everything. The current wave of post-transformer research, including the state-space and Mamba-style architectures explored elsewhere on this site, has already shown that there's appetite — and money — for alternatives to the now-seven-year-old transformer.

But the world-models argument is different in kind, not just in scale. It isn't merely proposing a faster or cheaper way to do what LLMs already do — it's proposing that LLMs are solving the wrong problem for an entire category of tasks: anything that requires acting in, navigating, or manipulating physical space. That category happens to include almost everything robotics, autonomous vehicles, surgical assistance, warehouse automation, and embodied "physical AI" are trying to do. It's no accident that Nvidia, whose business depends on exactly those industries, has thrown its weight behind the idea too — its newly launched Cosmos 3 platform is built specifically as an open foundation model for "physical AI," combining world simulation with action generation so that robots and autonomous systems can rehearse a task in a simulated world before ever attempting it in the real one.

The practical stakes are blunt: a self-driving car, a warehouse robot, or a surgical system that can only describe what might happen next — rather than predict it with some structural fidelity — is a system that will eventually be surprised by reality in a way that matters. World models are, in part, an attempt to close that gap before it becomes a headline.
INPUT: video of a glass tipping off a table edge Language model "The glass slides toward the edge, tips, and shatters on the floor." OUTPUT → fluent description, no internal physical model World model OUTPUT → simulated trajectory, predicted physical states over time
Fig. 1 — Two systems shown the same video, producing fundamentally different kinds of output: words versus a predicted physical sequence.

The bet, the money, and the doubts

The scale of the financial wager is itself a kind of evidence that something has shifted in how the field thinks. AMI Labs' seed round of just over a billion dollars — at a $3.5 billion pre-money valuation, raised before the company had shipped a single public product — is believed to be the largest seed-stage investment in European history. World Labs is reportedly in talks to raise $500 million at a $5 billion valuation on the strength of Marble alone. DeepMind, backed by Google's balance sheet, has made Genie a flagship research line rather than a side project. None of this guarantees the bet pays off, but it does mean the next eighteen months of AI research will be unusually well-funded in exactly this direction.

Not everyone is convinced the framing is even right. Some researchers argue that the supposed divide between "language" and "world" models is overstated — that the most capable systems of the next few years will likely be hybrids, blending transformer-based language understanding with JEPA-style predictive modules, diffusion components, and state-space layers, rather than replacing one paradigm with another wholesale. That hybrid view is, in fact, where much of the immediate roadmap already sits: Meta's own V-JEPA 2 doesn't discard language models, it complements them, pairing a predictive world module with a planning system that can use language to set goals.

"LLMs can describe a rotating cube. World models have to understand what rotation actually means — in space, with mass, with momentum. That's not a refinement of language modeling. It's a different kind of intelligence problem."
— Yann LeCun, on the distinction between language and world models

What it would mean if this works

It's worth being honest about how early this all still is. Genie 3's environments are visually compelling but not yet reliably consistent over long stretches; Marble's exported scenes need cleanup before they're production-ready; V-JEPA 2 is, for now, mostly a research milestone rather than a deployed product. The gap between "can generate a plausible three-dimensional room" and "can be trusted to help a robot perform surgery, drive through a school zone, or navigate a collapsed building" is enormous, and no one credible is claiming it will close quickly.

But if even a fraction of the optimism proves justified, the implications stretch well beyond chatbots and image generators. A machine that can simulate physical outcomes before acting is, in a real sense, a machine that can imagine — that can run "what if" scenarios internally before committing to a choice in the world. That is closer to how living creatures actually navigate uncertainty: not by reciting facts about gravity, but by having, somewhere in a nervous system, a working model of what gravity does. Whether silicon can build something functionally similar — and whether doing so brings us meaningfully closer to general intelligence, or simply produces better-behaved robots — may be the central open question in AI for the rest of this decade.

For now, the field has at least settled on a more interesting question to ask than "can it write a convincing sentence?" The new question is harder, stranger, and in some ways more honest: can it tell you, correctly, what happens next?

Sources

  1. Scientific American — "The next AI revolution could start with world models" (2026)
  2. TechCrunch — "Who's behind AMI Labs, Yann LeCun's 'world model' startup" (Jan. 2026)
  3. Tech Insider — "LeCun's AMI Labs Raises $1B+ to Beat LLMs" (2026)
  4. TechBuzz AI — "Yann LeCun's AMI Labs emerges with world model AI play" (2026)
  5. Introl — "World Models Race 2026" (2026)
  6. AI2.work — "World Models in 2026: Why LeCun, Fei-Fei Li, and DeepMind Bet Billions on 3D AI" (2026)
  7. Themesis Inc. — "World Models: Five Competing Approaches" (Jan. 2026)
  8. Humai — "World Models: The Quiet AI Revolution That Could Make LLMs Look Like a Warmup Act" (2026)
  9. NVIDIA Newsroom — "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI" (2026)
  10. Adaline Labs — "Beyond Transformers: The 7 AI Breakthroughs Reshaping Production in 2026" (2026)
  11. Medium / Aftab — "The End of LLMs As We Know Them: Why 2026 Marks the Beginning of AI's Next Architecture Revolution" (2026)
  12. HackMD — "World Models, From Zero to Hero" (2026)
  13. Meta AI — V-JEPA 2 research release notes (2026)
  14. Google DeepMind — Genie 3 technical overview (2026)
  15. World Labs — Marble product announcement (2026)
Ko-fi Buy me a coffee
Scroll to Top