Field Notes · The Economics of Intelligence

The Cheaper Mind

For years the story of AI was bigger, hungrier, more power-thirsty. A quieter revolution is now running the other way — teaching the machine to think for a twentieth of the cost. The question is whether that saves us, or simply unleashes more of it.

June 19, 2026 · Lisa Pedrosa · 10 min read Energy
PRIOR GEN 1/20th COST PER TOKEN ↓

Every dominant story about artificial intelligence for the past five years has been a story about more. More parameters, more data, more gigawatts, more data centers swallowing the output of whole power plants. The headline number that mattered was always going up. So it is worth pausing on a number, announced quietly on the first of June, that goes sharply down: a twentieth. One-twentieth of the compute, for the same work.

The number belongs to MiniMax M3, an open-weight model released on June 1, 2026, and the figure refers to its cost per token when handling very long inputs. Where the previous generation of models spent a certain amount of computation to process each word of a million-token document, M3 spends roughly five percent of that. The same intelligence, drawing a fraction of the power. It is the clearest signal yet of a counter-revolution running underneath the headlines — the discovery that the brute-force era of AI left enormous efficiency lying on the table, and that picking it up may matter more than the next size increase.

Where the waste was hiding

To see what M3 actually fixed, you have to understand the strange arithmetic of attention, the mechanism at the heart of every modern language model. Attention is what lets a model relate each word to every other word — it is the reason these systems can follow a long argument or a long codebase. But it is expensive in a particular, brutal way: in the classic design, the cost of attention grows with the square of the input length. Double the document and you roughly quadruple the work. Feed in a million tokens and the machine drowns, most of its effort spent comparing words that have nothing to do with each other.

MiniMax's answer, called MiniMax Sparse Attention, refuses to do all that pointless comparing. A lightweight "index" branch first skims the incoming text and decides which blocks of earlier words are actually relevant to the word being processed. Then full attention runs only on those blocks. The vast majority of word-pairs are never computed, because the model has correctly judged they do not matter. The result is the twentyfold cut — and, just as importantly, more than nine times faster processing of the initial prompt and more than fifteen times faster generation of each new word at long context.

1/20th
the per-token compute of the prior generation at 1M-token context
15×
faster decoding at long context; ~9× faster prefill
~100/s
tokens generated per second — roughly 3× a leading closed model
59.0%
SWE-Bench Pro score, beating GPT-5.5 and Gemini 3.1 Pro

What makes this more than a single company's clever trick is that the efficiency wave is breaking on several shores at once. At this year's premier machine-learning conference, Google researchers unveiled TurboQuant, an algorithm aimed at a different bottleneck — the so-called KV cache, the model's short-term memory of the conversation so far, which balloons in size as context grows. TurboQuant compresses that memory using a two-step mathematical maneuver (a vector rotation followed by a classic dimensionality-reduction trick) so that the same conversation fits in far less hardware memory. Different target, same project: wringing the slack out of a technology that was built, in its first frantic years, for capability at any cost.

The first era of AI asked how smart a machine could be. The second is asking how cheaply it can be that smart — and that is the question that decides who gets to use it.
— On the efficiency turn

Why a twentieth is a political number

It is tempting to file all this under engineering housekeeping, but the stakes are larger and stranger than that. Two facts make efficiency the most consequential frontier in AI right now. The first is energy. Data centers, driven largely by AI, are on track to consume on the order of 1,050 terawatt-hours of electricity in 2026 — enough that, if data centers were a country, they would rank among the largest electricity consumers on Earth, somewhere between Japan and Russia. And the dominant share of that draw is not the dramatic, one-time cost of training a model. It is inference: the everyday business of actually answering queries, which accounts for an estimated 80 to 90 percent of an AI system's lifetime energy use. Inference is exactly what sparse attention and KV-cache compression make cheaper. Cut inference cost by an order of magnitude and you have, in principle, bent the most worrying curve in the entire industry.

The second fact is access. When the cost of running a frontier-grade model drops twentyfold, the model stops being the exclusive property of whoever owns the biggest data center. That M3 is open-weight — its parameters freely downloadable — compounds the effect: a capable, million-token, multimodal system that a university lab, a hospital, or a startup in Nairobi can run on hardware they can actually afford. Efficiency is quietly democratizing, redistributing a technology that scaling had concentrated in a handful of hands.

Roughly 80 to 90 percent of all the energy an AI model will ever use is spent not on building it, but on running it, one answer at a time. Efficiency that targets inference is therefore not a footnote — it is the whole game.

The paradox that haunts every efficiency gain

And here the story bends back on itself, because there is a ghost in this machine, and his name is William Stanley Jevons. In 1865 the English economist noticed something that should have been impossible: as steam engines became more efficient at burning coal, Britain burned more coal, not less. Cheaper energy per unit of work meant more work got done, and the total appetite grew. Economists have called it the Jevons paradox ever since, and it stalks the efficiency story in AI like a shadow that cannot be outrun.

The logic is uncomfortably clean. If each AI query becomes twenty times cheaper, the rational response of the market is not to bank the savings but to run twenty times as many queries — to put AI into every search box, every document, every appliance, every background process that was previously too expensive to bother automating. Researchers who study AI's environmental footprint warn that per-unit efficiency gains are routinely "wiped out by volume growth." The cheaper mind does not necessarily sip less power. It may simply find itself invited into ten thousand new rooms.

THE JEVONS TRAP COST PER QUERY ↓ 20× TOTAL QUERIES RUN ↑ ??? NET ENERGY: UNSETTLED
Efficiency reliably lowers the cost of a single AI query. Whether it lowers total energy use depends on a behavior no chip can control: how much more we choose to ask.

This is not an argument against efficiency — it is an argument that efficiency alone settles nothing. Whether the cheaper mind eases AI's planetary footprint or enlarges it is not a question that sparse attention can answer. It is a question of policy, pricing, and restraint: whether the savings are allowed to translate into less energy used, or simply into more intelligence consumed. The technology hands us a choice it cannot make for us.

Make a thing cheap enough and you do not get less of it. You get it everywhere.
— The lesson of the steam engine, recast for silicon

The honest middle

It would be a mistake to oversell M3 and its cousins. MiniMax has not disclosed the model's full parameter count, independent reproductions of any headline number deserve scrutiny, and a benchmark score is not the same thing as a model that behaves well in the wild. Sparse attention also makes a bet — that the blocks it skips really were irrelevant — and on some tasks, where a stray detail buried deep in a long document turns out to matter, that bet can cost accuracy. Efficiency in AI almost always involves a trade, and the trade is rarely free. The art is in making it small enough to be worth it.

But the direction is unmistakable and, I think, healthy. For half a decade the field's imagination was captured by a single axis — scale — as though intelligence were a quantity you could only buy by the megawatt. The efficiency turn reopens the other axes: cleverer architectures, leaner memory, smarter use of the computation you already have. It is the difference between getting stronger by eating more and getting stronger by training better, and a discipline that only knows the first trick is a discipline in trouble.

The next year will tell us which force wins. If efficiency gains are captured as genuine reductions — fewer gigawatts per unit of useful work, frontier capability on modest hardware — then 2026 may be remembered as the year AI stopped being a technology that needed a planet and started becoming one that could fit inside a budget. If instead the Jevons ghost has his way, the same gains will vanish into an explosion of new uses, and the data-center map will keep darkening the horizon. Either way, the most important number in artificial intelligence is no longer how big the model is. It is how little it costs to think — and what, exactly, we decide to do with all that cheap thinking now that we have it.

Sources

  1. "MiniMax M3: Open-Weight Frontier Model with 1M Context," DataNorth AI. datanorth.ai/news/minimax-launches-m3
  2. "MiniMax Sparse Attention," paper page, Hugging Face (arXiv:2606.13392). huggingface.co/papers/2606.13392
  3. "MiniMax M3 Takes Open-Weight AI Lead: Sparse Attention Architecture Now Verified," Tech Times. techtimes.com
  4. "MiniMax teases upcoming M3 model with new sparse attention mechanism," VentureBeat. venturebeat.com
  5. "Serving MiniMax-M3 for efficient inference," Together AI. together.ai
  6. "MiniMax-M1: Scaling Test-Time Compute Efficiently with Lightning Attention," arXiv:2506.13585. arxiv.org/pdf/2506.13585
  7. TurboQuant / KV-cache compression, ICLR 2026 — coverage via "6 AI breakthroughs that will define 2026," InfoWorld. infoworld.com
  8. "Data Centers and AI Energy Consumption," Global Electricity. globalelectricity.org
  9. "Energy use of AI inference, efficiency pathways, and test-time scaling," ScienceDirect. sciencedirect.com
  10. "How Much Energy Will It Take To Power AI?," Contrary Research. research.contrary.com
  11. "From Efficiency Gains to Rebound Effects: The Problem of Jevons' Paradox in AI," arXiv:2501.16548 / ACM FAccT 2025. arxiv.org/html/2501.16548v1
  12. "Global energy demands within the AI regulatory landscape," Brookings. brookings.edu
  13. "Latest AI Model Releases — June 2026," LLM-Stats. llm-stats.com/ai-news
Ko-fi Buy me a coffee
Scroll to Top