What is NVIDIA Cosmos 3 and what does it do?

NVIDIA Cosmos 3 is an open world foundation model for physical AI, unveiled at GTC Taipei in June 2026. It generates synthetic video of plausible future scenes while reasoning about the physical plausibility of those scenes before rendering them, helping robots rehearse real-world tasks safely in simulation. It was trained on roughly 20 trillion tokens of multimodal data including images, videos, audio, text, and robot action data.

How does the Cosmos 3 mixture-of-transformers architecture work?

Cosmos 3 uses a two-tower mixture-of-transformers design. The first tower is an autoregressive vision-language reasoning transformer that interprets physical scene dynamics such as object collisions and motion. The second tower is a diffusion-style generation transformer that renders video and action trajectories consistent with the first tower's physical interpretation, both operating within a single unified forward pass.

What is the sim-to-real gap in robotics and how does Cosmos 3 address it?

The sim-to-real gap is the persistent problem where robot control policies trained in simulators fail to transfer reliably to the physical world because simulated physics never perfectly matches reality. Cosmos 3 narrows this gap by generating an almost unbounded number of physically grounded, photorealistic synthetic training scenarios, giving robot policies a much richer rehearsal space before real-world deployment. It does not eliminate the gap entirely but substantially improves the starting position for robotics teams.

What are the three versions of Cosmos 3 and what are they used for?

Cosmos 3 comes in three variants: Super, Nano, and the forthcoming Edge. Cosmos 3 Super, at roughly 64 billion parameters, is designed for offline datacenter-class synthetic data generation. Cosmos 3 Nano, around 16 billion parameters, is fast enough for near-real-time reasoning on workstation hardware. Cosmos 3 Edge, still to be released, targets real-time inference on the onboard compute of robots and vehicles themselves.

Which companies are using or partnering with NVIDIA Cosmos 3?

NVIDIA launched a Cosmos Coalition of partners including Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI. Reported early users also include automaker Li Auto for autonomous-vehicle development and Agility Robotics, which adopted NVIDIA's companion Halos safety system for its Digit humanoid deployed at Amazon, GXO, and Toyota Motor Manufacturing Canada.

Robotics & AI Architecture · Field Notes

The Rehearsal: Inside the World Model Teaching Robots to Think Before They Act

NVIDIA's new Cosmos 3 doesn't just generate video of a plausible future — it reasons about the physics of that future first. It is a bet that the fastest way to a working robot is to let it dream in a simulator, thousands of times over, before it ever touches the real world.

July 3, 2026 · Lisa Pedrosa · ~10 min read · Robotics · AI Architecture

Somewhere in an NVIDIA data center this spring, a simulated forklift backed toward a simulated pallet of simulated boxes, and the software watching it happen understood — before generating a single frame of video — that one of those boxes was about to topple. Not because someone hand-coded a rule about box-stacking. Because the system had spent its training absorbing roughly a billion images and 400 million videos of the actual physical world, humans and robots included, and had built something like an intuition for how mass and momentum behave. That intuition is the whole premise behind NVIDIA's newest release, Cosmos 3, unveiled at GTC Taipei in June 2026: an open "world foundation model" built on the idea that the fastest, safest way to make a robot competent in reality is to first let it rehearse reality, thousands of times, somewhere reality can't get hurt.

This arrives at a strange and specific moment for robotics — one where the machines themselves have stopped being a research curiosity and started showing up on loading docks. Figure AI says its BotQ factory in California has gone from building one humanoid a day to one an hour. Boston Dynamics is shipping the first commercial units of its all-electric Atlas to Hyundai and to Google DeepMind. Unitree is telling investors it expects to ship somewhere between 10,000 and 20,000 of its low-cost R1 humanoids this year alone. The hardware is no longer the bottleneck anyone worries about most. The question that keeps coming up instead, in boardrooms and research labs alike, is simpler and harder: how do you teach a robot to behave sensibly in a world it has never quite seen before, without getting it — or the person standing next to it — hurt while it learns?

The problem Cosmos 3 is actually trying to solve

Robotics has a decades-old, stubbornly persistent affliction called the sim-to-real gap. The idea is simple to state and painfully hard to fix: you train a robot's control policy inside a simulator, because simulators are cheap, fast, and infinitely repeatable, and then you deploy that policy onto a physical robot in the physical world — and it falls over. Or it grips too hard and crushes what it's holding. Or it hesitates at a doorway because the real-world lighting doesn't match anything in its training distribution. The simulated version of gravity, friction, and material deformation is never quite the same as the real thing, and control policies tend to exploit whatever quirks exist in a simulator rather than learning genuinely robust behavior. Traditional physics-engine simulators — the kind long used at companies like Boston Dynamics and in academic robotics labs — are accurate about rigid-body mechanics but expensive to build realistic scenes for, and they don't generate the kind of rich, messy, photorealistic sensory data a modern perception system needs to train on.

NVIDIA's answer, in a line the company has used repeatedly since the launch, is that a robot should be able to "think before it acts." Cosmos 3 is built to generate synthetic video of plausible future scenes and, critically, to reason about the physical plausibility of those scenes before it renders them — understanding object interactions, motion, and spatial-temporal relationships as a distinct cognitive step that happens prior to and in service of generation. That ordering is the whole architectural bet: reasoning first, imagery second, action third.

Why "world model" is not just marketing language: a world model, in the technical sense used here, is a system trained to predict what a scene will look like next given an action — essentially learning the rules of physics and cause-and-effect from data rather than from hand-written equations. Cosmos 3's distinction is doing that prediction jointly with an explicit reasoning stage, rather than asking a single network to intuit physics and pixels at once.

How the mixture-of-transformers architecture actually works

Under the hood, Cosmos 3 uses what NVIDIA calls a mixture-of-transformers, or MoT, design — sometimes described in early technical write-ups as a "two-tower" architecture. The first tower is a reasoning transformer: a vision-language model built in an autoregressive style, the same broad family of architecture behind most modern large language models. Its job is to take in multimodal observations — images, video, text descriptions, sensor readings — and interpret what is physically happening in a scene: which objects are touching, which are about to collide, which way something is moving and how fast.

The second tower is an expert generation transformer, built instead on a diffusion-style architecture, the approach that underlies most state-of-the-art image and video generators. Its job is to turn the reasoning tower's understanding into an actual rendered future — a video clip, a predicted robot trajectory, an action sequence — that is consistent with the physical interpretation the first tower just produced. NVIDIA's technical report describes this happening inside a single unified model: each transformer layer carries two separate sets of parameters, one processing the reasoning subsequence and one processing the generation subsequence, allowing the two towers to share context within a single forward pass rather than operating as two disconnected systems bolted together.

The practical effect is a model that can be asked to imagine "what happens if the robot arm grips the box here and lifts," and produce both a plausible video of that action unfolding and a physically grounded action trajectory the robot could actually execute — with the physics reasoning step acting as a check on the generation step, rather than leaving plausibility entirely to the generative model's imagination.

Cosmos 3's mixture-of-transformers architecture pairs a reasoning transformer with an expert generation transformer, enabling the system to understand object interactions, motion and spatial-temporal relationships before generating video and action trajectories.

— NVIDIA Newsroom, "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI," June 2026

Trained on almost everything NVIDIA could get its hands on

The scale of Cosmos 3's training run is difficult to hold in your head next to the models most people interact with day to day. NVIDIA says the model was trained on roughly 20 trillion tokens of multimodal data: nearly a billion images, about 400 million real and synthetic videos, plus ambient audio, text, and — distinctively — action data recorded from both humans and robots performing tasks. That last category matters more than it might sound. Video alone teaches a model what the world looks like. Paired action data — the actual joint angles, forces, and motion commands that produced an observed outcome — teaches a model what causes what. It's the difference between watching a thousand hours of someone cooking and actually knowing which muscle movements produced which result on the stove.

NVIDIA describes Cosmos 3 as the first fully open "omnimodel" for physical AI: a single system with native vision reasoning plus multimodal generation spanning text, image, video, ambient sound, and action, released with open weights rather than locked behind an API. That openness is itself a strategic move. Rather than compete only on a proprietary robotics stack, NVIDIA is trying to make Cosmos 3 the substrate that a much larger ecosystem of robotics and autonomous-vehicle companies build on top of — which is also the logic behind the "Cosmos Coalition" the company launched alongside the model.

20T

Training tokens across text, image, video, audio, and action data

~1B

Images in the training corpus

~400M

Real and synthetic videos used in training

Model variants: Super, Nano, and the upcoming Edge

Three sizes, three jobs

NVIDIA didn't release Cosmos 3 as one model but as a family tuned for different points in a robot's development pipeline. Cosmos 3 Super, reportedly built at roughly 64 billion parameters, is positioned as the highest-physics-accuracy option — the version you'd use offline, on datacenter-class GPUs, to generate the enormous synthetic datasets used for post-training robotics and autonomous-vehicle policy models. Cosmos 3 Nano, a smaller model in the neighborhood of 16 billion parameters, trades some accuracy for speed, designed to perform video and action reasoning in fractions of a second — fast enough to sit closer to a robot's actual decision loop, running on workstation-class hardware rather than a full datacenter. Cosmos 3 Edge, the smallest of the three and still to come, is intended for real-time inference directly on edge hardware — the onboard compute of the robot or vehicle itself, where every millisecond and every watt is contested.

That three-tier structure mirrors how the sim-to-real problem is actually solved in practice today: enormous amounts of offline synthetic data generation to build a robust policy, followed by a much leaner, faster model doing the real-time reasoning once that policy is deployed on physical hardware. Among open models, NVIDIA says Cosmos 3 currently ranks first on several third-party evaluations of world-generation accuracy, including Physics-IQ and R-Bench — benchmarks specifically designed to test whether a generated video obeys the physical rules of the world rather than merely looking convincing.

Fig. 1 — How Cosmos 3's two-tower mixture-of-transformers pipeline moves from raw observation to a trained robot policy.

Why a coalition, not just a product launch

Alongside Cosmos 3, NVIDIA announced the Cosmos Coalition, a group of partners committing to build on the open model together: Agile Robots, Black Forest Labs, Generalist, LTX, Runway, and Skild AI, among others named in NVIDIA's own announcements. The pitch is straightforwardly ecosystem-building — the same playbook NVIDIA has run with CUDA and, more recently, with its Isaac robotics platform. If world models become the shared substrate that many different robotics and vision-AI companies train on top of, rather than each lab building its own from scratch, NVIDIA's hardware and tooling become the default layer underneath all of them. Reported early users span a wide range of applications: Agile Robots and Skild AI using it for humanoid and manipulation research, automakers like Li Auto reportedly applying it to autonomous-vehicle development, and vision-AI companies applying the same underlying model to industrial monitoring and smart-space applications — a signal that NVIDIA sees Cosmos 3 as infrastructure for physical AI broadly, not a single-purpose robotics tool.

The stakes: robots are leaving the demo stage

None of this would matter as much if humanoid robots were still confined to trade-show stages. They aren't. Figure AI says its BotQ facility has pushed production from one Figure 03 unit a day in January to roughly one per hour by late April — a 24-fold increase in under four months — with a stated annual target near 12,000 units. Boston Dynamics unveiled a fully electric, production-ready version of Atlas at CES 2026 and has already committed its 2026 output to Hyundai's robotics manufacturing arm and to Google DeepMind, which is supplying the Gemini Robotics models that will serve as part of Atlas's reasoning layer. Unitree, meanwhile, is chasing volume rather than sophistication with its sub-$6,000 R1 humanoid, with its CEO suggesting the company alone could ship 10,000 to 20,000 units in 2026 — up sharply from roughly 5,500 the year before.

That deployment wave is precisely what makes the sim-to-real problem urgent rather than academic. A research-lab robot that occasionally misjudges a grip is a nuisance. A commercially deployed humanoid working alongside people in an Amazon warehouse or a Toyota manufacturing plant that occasionally misjudges a grip is a safety incident. NVIDIA's other June 2026 announcement, a full-stack safety architecture called Halos for Robotics, is best understood as the other half of the same story: where Cosmos 3 is about training a robot to behave well before it ever acts in the real world, Halos is about catching and constraining the moments when it doesn't. Agility Robotics has signed on as the first commercial adopter, integrating Halos into the safety systems of its Digit humanoid, which already works in logistics and warehouse settings for customers including Amazon, GXO, and Toyota Motor Manufacturing Canada.

The big bang of physical AI is just around the corner, thanks to breakthroughs in multimodal reasoning, language, vision, and world models.

— Jensen Huang, Founder and CEO, NVIDIA, GTC Taipei, June 2026

What this doesn't solve

It's worth being honest about the limits here, because the rhetoric around world models has a tendency to outrun the engineering. Cosmos 3 does not eliminate the sim-to-real gap; it narrows it by making the synthetic side of that gap dramatically richer, more physically grounded, and cheaper to generate at scale. A world model trained on 400 million videos still cannot have seen every real-world edge case a deployed robot will eventually encounter — the spilled liquid on an unexpected surface, the human who moves unpredictably, the object whose weight distribution doesn't match its appearance. What Cosmos 3 changes is the odds: instead of a robotics team hand-building a few hundred simulated scenarios and hoping they generalize, they can now generate an almost unbounded number of physically plausible variations and use them to harden a policy before it ever meets a real forklift, a real spill, or a real person. That is a meaningfully different starting position than robotics has had before — not a solved problem, but a substantially better rehearsal space.

What happens next

Expect the next twelve months to be less about a single splashy demo and more about whether the Cosmos Coalition's members can turn an open world model into measurably better robot behavior in the field — fewer dropped objects, fewer safety stops, faster generalization to new tasks without months of re-training. Watch for Cosmos 3 Edge's release, which will be the real test of whether this reasoning-before-acting architecture can run on the actual compute budget of a commercial robot rather than a data center. Watch, too, for how Halos for Robotics and Cosmos 3 end up talking to each other in practice, since a safety layer is only as good as the behavior it's constraining, and a well-rehearsed robot is only trustworthy if something is still watching for the rehearsal's blind spots. The framing NVIDIA has settled on — that physical AI needs to think before it acts — is a genuinely useful way to describe what's changed. The harder, more interesting question is how much of the real world a model can ever truly rehearse before it has to just go outside and try.

Sources

NVIDIA Newsroom. "NVIDIA Launches Cosmos 3, the Open Frontier Foundation Model for Physical AI." June 2026. nvidianews.nvidia.com
NVIDIA Blog. "How Cosmos 3 Helps Physical AI Think Before It Acts." blogs.nvidia.com
NVIDIA Technical Blog. "Develop Physical AI Reasoning, World, and Action Models with NVIDIA Cosmos 3." developer.nvidia.com
NVIDIA Research. "Cosmos 3: Omnimodal World Models for Physical AI" (technical report). research.nvidia.com/labs/cosmos-lab
Hugging Face Blog. "Welcome NVIDIA Cosmos 3: The First Open Omni-model for Physical AI Reasoning and Action." huggingface.co
MarkTechPost. "NVIDIA Releases Cosmos 3: A Two-Tower Mixture-of-Transformers Foundation Model Unifying Physical Reasoning, World Generation, and Action Generation." June 2026. marktechpost.com
the-decoder. "Nvidia bets big on physical AI at GTC Taipei with a new world model, driving brain, and open humanoid robot." the-decoder.com
NVIDIA Newsroom. "NVIDIA and Global Robotics Leaders Take Physical AI to the Real World." nvidianews.nvidia.com
NVIDIA Newsroom. "NVIDIA Announces Halos for Robotics, the Industry's First Full-Stack Safety System for Physical AI." nvidianews.nvidia.com
The Robot Report. "NVIDIA releases Halos, a full-stack safety system for robotics." therobotreport.com
Figure AI. "Ramping Figure 03 Production." figure.ai/news
Interesting Engineering. "Figure claims new BotQ facility can make one humanoid robot per hour." interestingengineering.com
The Robot Report. "Boston Dynamics, Google reunite on next-gen Atlas humanoid." therobotreport.com
eWeek. "China's Unitree Aims to Ship 20,000 Humanoid Robots in 2026." eweek.com
KraneShares. "Humanoid Robotics in 2026: The Race From Pilot to Platform." kraneshares.com

The Rehearsal: Inside the World Model Teaching Robots to Think Before They Act

The problem Cosmos 3 is actually trying to solve

How the mixture-of-transformers architecture actually works

Trained on almost everything NVIDIA could get its hands on

Three sizes, three jobs

Why a coalition, not just a product launch

The stakes: robots are leaving the demo stage

What this doesn't solve

What happens next

Sources

Related reading

The Robot Workforce

The Robot Line

After the Transformer

Rise of the Humanoids

Silicon Designs Itself

The Twenty-Watt Mind