1,479 tokens per second. That is the speed at which Google DeepMind's Gemini Diffusion generates text, measured by independent testers in May 2025 (with bursts above 1,600). The fastest autoregressive model at equivalent quality produces roughly 300. The difference is not incremental. It is architectural. For eight years, every large language model has generated text the same way: one token after another, left to right, each conditioned on everything that came before. Gemini Diffusion doesn't. It generates entire blocks simultaneously, refining noise into language the way image generators refine static into photographs. The constraint that defined the field since 2017 just broke.
The Number
On 20 May 2025, Google DeepMind released Gemini Diffusion as an experimental research prototype. The model is not yet a product. It doesn't power Google Search or Gmail or any consumer-facing application. It sits in Google AI Studio as a preview, available for testing but not for production deployment. None of that matters. What matters is that it works.
The benchmark comparisons are precise. On HumanEval, the standard code-generation benchmark, Gemini Diffusion scores 96%, compared to 90% for Gemini 2.0 Flash Lite, the autoregressive model it's benchmarked against. On MBPP, another code benchmark, it scores 76% versus 73.8%. These are not dramatic differences in accuracy. The speed is dramatic: five times faster at equivalent or better quality, generating 128 or more tokens in a single denoising step rather than producing them one at a time.
To understand why 1,479 tokens per second changes the economics and architecture of AI, you first need to understand the constraint it removes.
What It Took to Find It
Every transformer-based language model since the original 2017 "Attention Is All You Need" paper has used autoregressive generation. The model predicts the next token in a sequence, appends it, then predicts the token after that, and so on. Each prediction depends on every token that came before it. This is why language models feel like they're "typing": they are. Token by token, left to right, with no ability to look ahead or revise what they've already committed to.
The constraint is fundamental, not cosmetic. An autoregressive model generating a 500-token response makes 500 sequential forward passes through the network. Each pass depends on the previous one. You can't parallelise the generation itself, only the computation within each pass. This creates a hard ceiling on generation speed that no amount of hardware can fully overcome.
Autoregressive vs. diffusion generation: An autoregressive model writes a sentence the way a typewriter does: one character at a time, left to right, no going back. A diffusion model works more like a photograph developing in a chemical bath: the entire image emerges at once, blurry at first, and sharpens over several refinement steps. Gemini Diffusion applies this same principle to text, starting with a block of random noise tokens and iteratively denoising them into coherent language.
Diffusion models have dominated image generation since 2020. Stable Diffusion, DALL-E, Midjourney: all of them work by starting with random noise and progressively removing it until a coherent image appears. The core idea, learning to reverse a corruption process, was well understood. Applying it to discrete tokens rather than continuous pixel values was the hard part.
Text is not continuous. Pixels occupy a smooth, infinitely divisible colour space where "slightly more red" is a meaningful operation. Tokens are discrete: "cat" and "car" differ by one character but occupy completely separate positions in vocabulary space. You can't add Gaussian noise to a token the way you add it to a pixel. The forward diffusion process for text needs a different corruption mechanism.
The breakthrough came from masking. Instead of adding noise, the model randomly replaces tokens with mask tokens, progressively obscuring the text until it's entirely masked. The reverse process, generation, starts from a fully masked sequence and iteratively predicts which real tokens should replace the masks. This is the approach used by LLaDA (Large Language Diffusion with mAsking), a research model that demonstrated the viability of the concept, and it's the foundation Gemini Diffusion builds on.
The critical advantage: each denoising step can unmask multiple tokens simultaneously. The model considers the entire sequence bidirectionally, attending to both left and right context, and predicts a block of tokens at once. Where an autoregressive model makes 500 sequential passes for a 500-token response, a diffusion model might make 4 to 8 refinement passes, each one filling in or correcting 128 or more tokens simultaneously.
Why It Was Missed
It wasn't missed, exactly. Researchers have been exploring discrete diffusion for text since at least 2022. The Masked-Diffuse LM (MDLM) framework showed that text diffusion could work in principle. LLaDA, published in early 2025, demonstrated that a diffusion language model with 8 billion parameters could match the performance of autoregressive models of the same size while using far fewer training tokens (2.3 trillion versus 15 trillion for LLaMA 3). The idea was in the air.
What wasn't in the air was the engineering required to make it fast at scale. Academic research prototypes proved the concept but couldn't match production autoregressive models on real-world latency. The gap between "this approach works in principle" and "this approach generates 1,479 tokens per second" is the gap Google's infrastructure fills. DeepMind has been building custom TPU hardware for over a decade. The vertical integration from silicon to model architecture to serving infrastructure is what turns a research idea into a number that changes the conversation.
There's a second reason diffusion language models took longer to arrive than diffusion image models. Images tolerate imperfection gracefully: a slightly blurry region or a subtly wrong colour is barely noticeable. Text does not. A single wrong token can change the meaning of a sentence entirely. The refinement process has to be more precise, the error correction more aggressive, and the number of denoising steps carefully calibrated to balance speed against coherence. Gemini Diffusion's bidirectional attention is part of the solution: because the model sees the entire sequence at every step, it can catch and correct inconsistencies that an autoregressive model, locked into its left-to-right commitment, cannot.
“Gemini Diffusion can generate text and code with a quality similar to Gemini 2.0 Flash Lite and at significantly faster speeds.”
Google DeepMind, Gemini Diffusion announcement page, May 2025What Changes
Speed changes cost. At five times the tokens per second, you can serve five times the requests with the same hardware. Or you can serve the same number of requests with one-fifth the GPUs. In an industry where compute is the binding constraint, where OpenAI reportedly spends hundreds of millions per month on inference infrastructure, a 5x speed improvement at equivalent quality is not a benchmark curiosity. It's a structural cost reduction.
Speed also changes what's possible. Agentic AI systems, where models plan and execute multi-step tasks autonomously, are bottlenecked by generation latency. An agent that needs to generate twenty intermediate reasoning steps before acting is twenty sequential generation calls. At 300 tokens per second, that's noticeable. At 1,479, it approaches real-time. The gap between "AI assistant" and "AI colleague" is partly a latency gap, and diffusion narrows it.
The deeper change is architectural. For eight years, the assumption has been that language models generate sequentially. Every optimisation, every inference framework, every hardware design has been built around that assumption: KV caches that grow linearly with sequence length, speculative decoding that tries to guess the next few tokens to batch them, quantisation schemes designed to reduce the cost of each sequential forward pass. Diffusion generation breaks all of those assumptions. The cache structure is different. The parallelism pattern is different. The relationship between model size and generation speed is different.
Google isn't the only player moving in this direction. LLaDA and its successors (LLaDA-o, LaViDa) are open research. The discrete diffusion paradigm is publishable, reproducible, and architecturally distinct enough that it can't be patented away. Within two years, every major lab will have a diffusion language model in its lineup. The question is whether diffusion replaces autoregressive generation entirely or coexists alongside it, the way transformers replaced RNNs for most tasks but left recurrence alive in specialised niches.
The number says replacement. 1,479 tokens per second, at equivalent quality, with bidirectional context and built-in error correction. The autoregressive paradigm defined a generation of AI. The generation after it just started.
- Google DeepMind, “Gemini Diffusion,” model announcement page, May 2025. deepmind.google
- DataCamp, “Gemini Diffusion: A Guide With 8 Practical Examples,” 2025. datacamp.com
- Simon Willison, “Gemini Diffusion,” May 2025. simonwillison.net
- Hugging Face Blog, “Diffusion Language Models: The New Paradigm,” 2025. huggingface.co
- Chatterjee, S., “Diffusion Models for Language: From Early Promise to a Bold New Frontier with LLaDA,” Medium, 2025. medium.com
- CTOL Digital Solutions, “Google DeepMind Unveils Gemini Diffusion: A Paradigm Shift in AI Text Generation,” 2025. ctol.digital
- CometAPI, “What is Gemini Diffusion? All You Need to Know,” 2025. cometapi.com
- LLaDA Project, “Large Language Diffusion Models,” demo page. ml-gsai.github.io
Buy me a coffee