AI & Science · Architecture · June 2026
A language model can describe a glass tipping off a table — the words "shatter," "spill," "stoop to clean it up." A world model is being trained to know, in some structural sense, how the glass falls. That difference, researchers increasingly believe, separates a very good autocomplete from something that can actually plan, imagine, and act in the world.
In a converted warehouse near Place de la République, an AI system watches a video of a tower of blocks falling — and is asked, simply, to predict what the next half-second of footage will look like. It isn't asked to caption the scene, summarize it, or write a poem about gravity. It has to know, in some internal way, what gravity does. This is the strange, quiet shift now underway at the edges of artificial intelligence research: a move away from machines that talk fluently about the world and toward machines that are starting to model it.
For three years, the dominant story of AI has been the large language model — systems trained on oceans of text that learned to write, reason in words, and hold conversations with unsettling fluency. But a growing chorus of senior researchers, including some who built the foundations of that very revolution, now argue that language alone is a dead end for building systems that can truly understand and act in physical space. Their answer is something called a "world model": an AI that doesn't just describe reality in words, but builds an internal, predictive simulation of how reality behaves — how objects move, collide, occlude each other, and respond to being touched.
The shift has a name, a growing roster of billion-dollar bets, and, as of this spring, its own exodus story. In January, Yann LeCun — one of the three researchers often called the "godfathers of deep learning" and, until recently, Meta's chief AI scientist — left the company to found a new venture, AMI Labs, devoted entirely to building world models from scratch. He didn't leave quietly. LeCun has spent much of the past two years arguing, often bluntly, that large language models are "a distraction" from the real goal of machine intelligence: systems that can plan, reason about cause and effect, and learn the way a human toddler learns — mostly by watching and poking at the physical world, not by reading the internet.
To understand why this matters, it helps to see what large language models are actually doing under the hood. An LLM is, at its core, a very sophisticated next-word predictor: given a sequence of text, it estimates which word is statistically likely to come next. It can describe a ball rolling off a table with startling eloquence — because millions of human-written sentences about balls and tables and falling are baked into its training data. What it cannot do, critics like LeCun argue, is simulate the event. It has no internal sense of the trajectory, the bounce, the way the ball might roll under a couch and stop.
A world model flips that approach. Instead of training on text, it trains on raw sensory experience — usually video — and learns to predict what happens next not in words, but in a compressed, abstract representation of the scene itself: where objects are, how they're moving, what's hidden behind what. Meta's V-JEPA 2, released in early 2026 and trained on more than a million hours of internet video, is one of the most prominent examples: rather than trying to reconstruct every pixel of a future frame (an enormously wasteful task), it predicts the underlying structure of the scene in a compact "latent space," the way a person doesn't consciously track every photon when they catch a thrown ball — they just know, intuitively, where it's going to be.
Google DeepMind has taken a more dramatic public swing at the idea with Genie 3, unveiled this year as the first real-time interactive world model: type in a description, and it generates a navigable, persistent three-dimensional environment you can walk through and manipulate at 24 frames per second, improvising new physics and geography as you go. Fei-Fei Li — the Stanford computer scientist famous for building ImageNet, the dataset that helped launch the deep-learning era in the first place — has gone commercial with hers: her startup World Labs shipped Marble this year, a world-model product that generates persistent 3D scenes that can be exported directly into game engines like Unreal and Unity, priced for studios at $20 to $95 a month.
"You can ask a language model to describe what happens when you push a glass off a table. Ask it to actually predict the next ten seconds of that video, frame by physical frame, and it falls apart. That's the gap we're trying to close — and it's not a small one."— Researcher quoted on the world-models shift, Scientific American, 2026
Skeptics have reason to be wary of grand architectural pronouncements — AI has seen plenty of "next big things" arrive with fanfare and quietly fold into the broader toolkit. Mixture-of-experts, retrieval augmentation, diffusion, state-space models: each was, briefly, going to change everything. The current wave of post-transformer research, including the state-space and Mamba-style architectures explored elsewhere on this site, has already shown that there's appetite — and money — for alternatives to the now-seven-year-old transformer.
But the world-models argument is different in kind, not just in scale. It isn't merely proposing a faster or cheaper way to do what LLMs already do — it's proposing that LLMs are solving the wrong problem for an entire category of tasks: anything that requires acting in, navigating, or manipulating physical space. That category happens to include almost everything robotics, autonomous vehicles, surgical assistance, warehouse automation, and embodied "physical AI" are trying to do. It's no accident that Nvidia, whose business depends on exactly those industries, has thrown its weight behind the idea too — its newly launched Cosmos 3 platform is built specifically as an open foundation model for "physical AI," combining world simulation with action generation so that robots and autonomous systems can rehearse a task in a simulated world before ever attempting it in the real one.
The scale of the financial wager is itself a kind of evidence that something has shifted in how the field thinks. AMI Labs' seed round of just over a billion dollars — at a $3.5 billion pre-money valuation, raised before the company had shipped a single public product — is believed to be the largest seed-stage investment in European history. World Labs is reportedly in talks to raise $500 million at a $5 billion valuation on the strength of Marble alone. DeepMind, backed by Google's balance sheet, has made Genie a flagship research line rather than a side project. None of this guarantees the bet pays off, but it does mean the next eighteen months of AI research will be unusually well-funded in exactly this direction.
Not everyone is convinced the framing is even right. Some researchers argue that the supposed divide between "language" and "world" models is overstated — that the most capable systems of the next few years will likely be hybrids, blending transformer-based language understanding with JEPA-style predictive modules, diffusion components, and state-space layers, rather than replacing one paradigm with another wholesale. That hybrid view is, in fact, where much of the immediate roadmap already sits: Meta's own V-JEPA 2 doesn't discard language models, it complements them, pairing a predictive world module with a planning system that can use language to set goals.
"LLMs can describe a rotating cube. World models have to understand what rotation actually means — in space, with mass, with momentum. That's not a refinement of language modeling. It's a different kind of intelligence problem."— Yann LeCun, on the distinction between language and world models
It's worth being honest about how early this all still is. Genie 3's environments are visually compelling but not yet reliably consistent over long stretches; Marble's exported scenes need cleanup before they're production-ready; V-JEPA 2 is, for now, mostly a research milestone rather than a deployed product. The gap between "can generate a plausible three-dimensional room" and "can be trusted to help a robot perform surgery, drive through a school zone, or navigate a collapsed building" is enormous, and no one credible is claiming it will close quickly.
But if even a fraction of the optimism proves justified, the implications stretch well beyond chatbots and image generators. A machine that can simulate physical outcomes before acting is, in a real sense, a machine that can imagine — that can run "what if" scenarios internally before committing to a choice in the world. That is closer to how living creatures actually navigate uncertainty: not by reciting facts about gravity, but by having, somewhere in a nervous system, a working model of what gravity does. Whether silicon can build something functionally similar — and whether doing so brings us meaningfully closer to general intelligence, or simply produces better-behaved robots — may be the central open question in AI for the rest of this decade.
For now, the field has at least settled on a more interesting question to ask than "can it write a convincing sentence?" The new question is harder, stranger, and in some ways more honest: can it tell you, correctly, what happens next?

Mamba, state-space models, and the quiet hunt for what comes after the architecture that built the LLM era.

A history of digital minds — and how the dream of a thinking machine has shifted shape again and again.

What does it mean for a machine to reflect on itself — and how would we ever know if it did?

How an autonomous AI system named Robin designed and validated a new drug candidate without a human in the loop.

Humanoid robots are clocking in. What their first year on real factory floors actually looked like.

Inside the strange, growing field of biological computing — machines built from living cells.
Buy me a coffee