Robotics & AI Architecture · Field Notes

The Rehearsal: Inside the World Model Teaching Robots to Think Before They Act


NVIDIA's new Cosmos 3 doesn't just generate video of a plausible future — it reasons about the physics of that future first. It is a bet that the fastest way to a working robot is to let it dream in a simulator, thousands of times over, before it ever touches the real world.

July 3, 2026 · Lisa Pedrosa · ~10 min read · Robotics · AI Architecture

Somewhere in an NVIDIA data center this spring, a simulated forklift backed toward a simulated pallet of simulated boxes, and the software watching it happen understood — before generating a single frame of video — that one of those boxes was about to topple. Not because someone hand-coded a rule about box-stacking. Because the system had spent its training absorbing roughly a billion images and 400 million videos of the actual physical world, humans and robots included, and had built something like an intuition for how mass and momentum behave. That intuition is the whole premise behind NVIDIA's newest release, Cosmos 3, unveiled at GTC Taipei in June 2026: an open "world foundation model" built on the idea that the fastest, safest way to make a robot competent in reality is to first let it rehearse reality, thousands of times, somewhere reality can't get hurt.

This arrives at a strange and specific moment for robotics — one where the machines themselves have stopped being a research curiosity and started showing up on loading docks. Figure AI says its BotQ factory in California has gone from building one humanoid a day to one an hour. Boston Dynamics is shipping the first commercial units of its all-electric Atlas to Hyundai and to Google DeepMind. Unitree is telling investors it expects to ship somewhere between 10,000 and 20,000 of its low-cost R1 humanoids this year alone. The hardware is no longer the bottleneck anyone worries about most. The question that keeps coming up instead, in boardrooms and research labs alike, is simpler and harder: how do you teach a robot to behave sensibly in a world it has never quite seen before, without getting it — or the person standing next to it — hurt while it learns?

The problem Cosmos 3 is actually trying to solve

Robotics has a decades-old, stubbornly persistent affliction called the sim-to-real gap. The idea is simple to state and painfully hard to fix: you train a robot's control policy inside a simulator, because simulators are cheap, fast, and infinitely repeatable, and then you deploy that policy onto a physical robot in the physical world — and it falls over. Or it grips too hard and crushes what it's holding. Or it hesitates at a doorway because the real-world lighting doesn't match anything in its training distribution. The simulated version of gravity, friction, and material deformation is never quite the same as the real thing, and control policies tend to exploit whatever quirks exist in a simulator rather than learning genuinely robust behavior. Traditional physics-engine simulators — the kind long used at companies like Boston Dynamics and in academic robotics labs — are accurate about rigid-body mechanics but expensive to build realistic scenes for, and they don't generate the kind of rich, messy, photorealistic sensory data a modern perception system needs to train on.

NVIDIA's answer, in a line the company has used repeatedly since the launch, is that a robot should be able to "think before it acts." Cosmos 3 is built to generate synthetic video of plausible future scenes and, critically, to reason about the physical plausibility of those scenes before it renders them — understanding object interactions, motion, and spatial-temporal relationships as a distinct cognitive step that happens prior to and in service of generation. That ordering is the whole architectural bet: reasoning first, imagery second, action third.

Why "world model" is not just marketing language: a world model, in the technical sense used here, is a system trained to predict what a scene will look like next given an action — essentially learning the rules of physics and cause-and-effect from data rather than from hand-written equations. Cosmos 3's distinction is doing that prediction jointly with an explicit reasoning stage, rather than asking a single network to intuit physics and pixels at once.

How the mixture-of-transformers architecture actually works

Under the hood, Cosmos 3 uses what NVIDIA calls a mixture-of-transformers, or MoT, design — sometimes described in early technical write-ups as a "two-tower" architecture. The first tower is a reasoning transformer: a vision-language model built in an autoregressive style, the same broad family of architecture behind most modern large language models. Its job is to take in multimodal observations — images, video, text descriptions, sensor readings — and interpret what is physically happening in a scene: which objects are touching, which are about to collide, which way something is moving and how fast.

The second tower is an expert generation transformer, built instead on a diffusion-style architecture, the approach that underlies most state-of-the-art image and video generators. Its job is to turn the reasoning tower's understanding into an actual rendered future — a video clip, a predicted robot trajectory, an action sequence — that is consistent with the physical interpretation the first tower just produced. NVIDIA's technical report describes this happening inside a single unified model: each transformer layer carries two separate sets of parameters, one processing the reasoning subsequence and one processing the generation subsequence, allowing the two towers to share context within a single forward pass rather than operating as two disconnected systems bolted together.

The practical effect is a model that can be asked to imagine "what happens if the robot arm grips the box here and lifts," and produce both a plausible video of that action unfolding and a physically grounded action trajectory the robot could actually execute — with the physics reasoning step acting as a check on the generation step, rather than leaving plausibility entirely to the generative model's imagination.

Cosmos 3's mixture-of-transformers architecture pairs a reasoning transformer with an expert generation transformer, enabling the system to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories.
— NVIDIA Newsroom, "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI," June 2026

Trained on almost everything NVIDIA could get its hands on

The scale of Cosmos 3's training run is difficult to hold in your head next to the models most people interact with day to day. NVIDIA says the model was trained on roughly 20 trillion tokens of multimodal data: nearly a billion images, about 400 million real and synthetic videos, plus ambient audio, text, and — distinctively — action data recorded from both humans and robots performing tasks. That last category matters more than it might sound. Video alone teaches a model what the world looks like. Paired action data — the actual joint angles, forces, and motion commands that produced an observed outcome — teaches a model what causes what. It's the difference between watching a thousand hours of someone cooking and actually knowing which muscle movements produced which result on the stove.

NVIDIA describes Cosmos 3 as the first fully open "omnimodel" for physical AI: a single system with native vision reasoning plus multimodal generation spanning text, image, video, ambient sound, and action, released with open weights rather than locked behind an API. That openness is itself a strategic move. Rather than compete only on a proprietary robotics stack, NVIDIA is trying to make Cosmos 3 the substrate that a much larger ecosystem of robotics and autonomous-vehicle companies build on top of — which is also the logic behind the "Cosmos Coalition" the company launched alongside the model.

20T
Training tokens across text, image, video, audio, and action data
~1B
Images in the training corpus
~400M
Real and synthetic videos used in training
3
Model variants: Super, Nano, and the upcoming Edge

Three sizes, three jobs

NVIDIA didn't release Cosmos 3 as one model but as a family tuned for different points in a robot's development pipeline. Cosmos 3 Super, reportedly built at roughly 64 billion parameters, is positioned as the highest-physics-accuracy option — the version you'd use offline, on datacenter-class GPUs, to generate the enormous synthetic datasets used for post-training robotics and autonomous-vehicle policy models. Cosmos 3 Nano, a smaller model in the neighborhood of 16 billion parameters, trades some accuracy for speed, designed to perform video and action reasoning in fractions of a second — fast enough to sit closer to a robot's actual decision loop, running on workstation-class hardware rather than a full datacenter. Cosmos 3 Edge, the smallest of the three and still to come, is intended for real-time inference directly on edge hardware — the onboard compute of the robot or vehicle itself, where every millisecond and every watt is contested.

That three-tier structure mirrors how the sim-to-real problem is actually solved in practice today: enormous amounts of offline synthetic data generation to build a robust policy, followed by a much leaner, faster model doing the real-time reasoning once that policy is deployed on physical hardware. Among open models, NVIDIA says Cosmos 3 currently ranks first on several third-party evaluations of world-generation accuracy, including Physics-IQ and R-Bench — benchmarks specifically designed to test whether a generated video obeys the physical rules of the world rather than merely looking convincing.

THE COSMOS 3 PIPELINE — THINK, THEN ACT Observations Video, images, text, sensor + action data Reasoning Tower Autoregressive VLM interprets physics, motion, object contact Generation Tower Diffusion model renders video and action trajectories Robot Policy Trained on rehearsed futures, deployed to real hardware WHY THE ORDER MATTERS Reasoning happens before rendering, not after — so implausible physics gets caught before it's baked into training data the robot's policy would otherwise learn from.
Fig. 1 — How Cosmos 3's two-tower mixture-of-transformers pipeline moves from raw observation to a trained robot policy.

Why a coalition, not just a product launch

Alongside Cosmos 3, NVIDIA announced the Cosmos Coalition, a group of partners committing to build on the open model together: Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI, among others named in NVIDIA's own announcements. The pitch is straightforwardly ecosystem-building — the same playbook NVIDIA has run with CUDA and, more recently, with its Isaac robotics platform. If world models become the shared substrate that many different robotics and vision-AI companies train on top of, rather than each lab building its own from scratch, NVIDIA's hardware and tooling become the default layer underneath all of them. Reported early users span a wide range of applications: Agile Robots and Skild AI using it for humanoid and manipulation research, automakers like Li Auto reportedly applying it to autonomous-vehicle development, and vision-AI companies applying the same underlying model to industrial monitoring and smart-space applications — a signal that NVIDIA sees Cosmos 3 as infrastructure for physical AI broadly, not a single-purpose robotics tool.

The stakes: robots are leaving the demo stage

None of this would matter as much if humanoid robots were still confined to trade-show stages. They aren't. Figure AI says its BotQ facility has pushed production from one Figure 03 unit a day in January to roughly one per hour by late April — a 24-fold increase in under four months — with a stated annual target near 12,000 units. Boston Dynamics unveiled a fully electric, production-ready version of Atlas at CES 2026 and has already committed its 2026 output to Hyundai's robotics manufacturing arm and to Google DeepMind, which is supplying the Gemini Robotics models that will serve as part of Atlas's reasoning layer. Unitree, meanwhile, is chasing volume rather than sophistication with its sub-$6,000 R1 humanoid, with its CEO suggesting the company alone could ship 10,000 to 20,000 units in 2026 — up sharply from roughly 5,500 the year before.

That deployment wave is precisely what makes the sim-to-real problem urgent rather than academic. A research-lab robot that occasionally misjudges a grip is a nuisance. A commercially deployed humanoid working alongside people in an Amazon warehouse or a Toyota manufacturing plant that occasionally misjudges a grip is a safety incident. NVIDIA's other June 2026 announcement, a full-stack safety architecture called Halos for Robotics, is best understood as the other half of the same story: where Cosmos 3 is about training a robot to behave well before it ever acts in the real world, Halos is about catching and constraining the moments when it doesn't. Agility Robotics has signed on as the first commercial adopter, integrating Halos into the safety systems of its Digit humanoid, which already works in logistics and warehouse settings for customers including Amazon, GXO, and Toyota Motor Manufacturing Canada.

The big bang of physical AI is just around the corner, thanks to breakthroughs in multimodal reasoning, language, vision, and world models.
— Jensen Huang, Founder and CEO, NVIDIA, GTC Taipei, June 2026

What this doesn't solve

It's worth being honest about the limits here, because the rhetoric around world models has a tendency to outrun the engineering. Cosmos 3 does not eliminate the sim-to-real gap; it narrows it by making the synthetic side of that gap dramatically richer, more physically grounded, and cheaper to generate at scale. A world model trained on 400 million videos still cannot have seen every real-world edge case a deployed robot will eventually encounter — the spilled liquid on an unexpected surface, the human who moves unpredictably, the object whose weight distribution doesn't match its appearance. What Cosmos 3 changes is the odds: instead of a robotics team hand-building a few hundred simulated scenarios and hoping they generalize, they can now generate an almost unbounded number of physically plausible variations and use them to harden a policy before it ever meets a real forklift, a real spill, or a real person. That is a meaningfully different starting position than robotics has had before — not a solved problem, but a substantially better rehearsal space.

What happens next

Expect the next twelve months to be less about a single splashy demo and more about whether the Cosmos Coalition's members can turn an open world model into measurably better robot behavior in the field — fewer dropped objects, fewer safety stops, faster generalization to new tasks without months of re-training. Watch for Cosmos 3 Edge's release, which will be the real test of whether this reasoning-before-acting architecture can run on the actual compute budget of a commercial robot rather than a data center. Watch, too, for how Halos for Robotics and Cosmos 3 end up talking to each other in practice, since a safety layer is only as good as the behavior it's constraining, and a well-rehearsed robot is only trustworthy if something is still watching for the rehearsal's blind spots. The framing NVIDIA has settled on — that physical AI needs to think before it acts — is a genuinely useful way to describe what's changed. The harder, more interesting question is how much of the real world a model can ever truly rehearse before it has to just go outside and try.

Sources

  1. NVIDIA Newsroom. "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI." June 2026. nvidianews.nvidia.com
  2. NVIDIA Blog. "How Cosmos 3 Helps Physical AI Think Before It Acts." blogs.nvidia.com
  3. NVIDIA Technical Blog. "Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3." developer.nvidia.com
  4. NVIDIA Research. "Cosmos 3: Omnimodal World Models for Physical AI" (technical report). research.nvidia.com/labs/cosmos-lab
  5. Hugging Face Blog. "Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action." huggingface.co
  6. MarkTechPost. "NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation." June 2026. marktechpost.com
  7. the-decoder. "Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot." the-decoder.com
  8. NVIDIA Newsroom. "NVIDIA and Global Robotics Leaders Take Physical AI to the Real World." nvidianews.nvidia.com
  9. NVIDIA Newsroom. "NVIDIA Announces Halos for Robotics, the Industry's First Full-Stack Safety System for Physical AI." nvidianews.nvidia.com
  10. The Robot Report. "NVIDIA releases Halos, a full-stack safety system for robotics." therobotreport.com
  11. Figure AI. "Ramping Figure 03 Production." figure.ai/news
  12. Interesting Engineering. "Figure claims new BotQ facility can make one humanoid robot per hour." interestingengineering.com
  13. The Robot Report. "Boston Dynamics, Google reunite on next-gen Atlas humanoid." therobotreport.com
  14. eWeek. "China's Unitree Aims to Ship 20,000 Humanoid Robots in 2026." eweek.com
  15. KraneShares. "Humanoid Robotics in 2026: The Race From Pilot to Platform." kraneshares.com

Share this article

Share on LinkedIn
Ko-fi Buy me a coffee
Scroll to Top