How does MiniMax M3 score on SWE-Bench Pro compared to GPT-5.5 and Gemini?

M3 scored 59.0% on SWE-Bench Pro, surpassing both OpenAI GPT-5.5 (approximately 54%) and Google Gemini 3.1 Pro (approximately 53%). However, it still trails the top closed model, Claude Opus 4.8, which is reported at around 69.2% on the same benchmark.

What is a Mixture-of-Experts model and how does M3 use it?

A Mixture-of-Experts model routes each token of input to only a small subset of specialized sub-networks, called experts, rather than activating the entire model. M3 has 229.9 billion total parameters but activates only about 9.8 billion per token across 256 fine-grained experts, dramatically reducing compute costs while preserving broad capability.

How does M3 handle a one-million-token context window efficiently?

M3 uses a sparse-attention mechanism called MSA that cuts per-token compute at full one-million-token context to roughly one-twentieth of the previous generation. This delivers a 9.7x faster prefill and a 15.6x faster decode, allowing the model to process very long documents without prohibitive hardware costs.

What are the risks of open-weight AI models like MiniMax M3?

Once an open-weight model is downloaded by hundreds of thousands of users, it cannot be recalled or patched remotely. Safety guardrails that closed API providers enforce can be stripped by anyone willing to fine-tune the model, distributing powerful capability to actors who may not follow any rules or safety guidelines.

MiniMax M3: The Open Model That Caught the Giants

Q: What is MiniMax M3 and what makes it different from other AI models?

MiniMax M3 is an open-weight AI model released on June 1, 2026, by Chinese lab MiniMax. It is the first open-weight model to combine frontier-level coding ability, a one-million-token context window, and native multimodality in a single architecture, meaning anyone can download, run, and fine-tune it without paying a fee or asking permission.

For most of the modern AI era, there has been an unspoken hierarchy. The very best models—the ones that could write production-grade code, reason through hard problems, and hold an entire codebase in mind at once—lived inside a handful of American companies, accessible only through a metered interface, their weights a closely guarded secret. Everyone else got the leftovers. On June 1, 2026, a Chinese lab called MiniMax released a model that quietly upended that arrangement.

It is named M3, and its claim is audacious: the first open-weight model to combine frontier-level coding ability, a one-million-token context window, and native multimodality—image and video understanding—all in a single architecture. "Open-weight" is the crucial word. Anyone can download M3, run it on their own hardware, inspect it, fine-tune it, and build on it without asking permission or paying a toll. And on at least one closely watched benchmark, it beat the flagship closed models from the largest labs in the world.

The Numbers That Made People Look Twice

On SWE-Bench Pro—a demanding test of real-world software engineering, where a model must fix actual bugs in actual code repositories—M3 scored 59.0%, surpassing both OpenAI's GPT-5.5 and Google's Gemini 3.1 Pro. For an openly released model to top the proprietary giants on a coding benchmark, even a single one, is the kind of result that reorders expectations.

It is worth being honest about the ceiling, too. M3 does not beat everyone everywhere. On that same SWE-Bench Pro, the very top closed models still lead—Claude Opus 4.8, for instance, is reported around 69.2%, comfortably ahead. On Terminal-Bench and OSWorld-Verified, M3 trails the frontier leaders as well. The story is not that the open frontier has won. It is that the gap, once a chasm, has shrunk to a stride.

59.0%

SWE-Bench Pro — ahead of GPT-5.5 & Gemini 3.1 Pro

Token context window

9.8B

Active params per token (of 229.9B total)

~20×

Less compute at 1M context vs. prior generation

The Trick Is in the Architecture

The most interesting part of M3 is not its scores but how it achieves them so cheaply. M3 is a Mixture-of-Experts model. Its full size is enormous—229.9 billion parameters—but for any given token of text it processes, it activates only about 9.8 billion of them, routed across 256 fine-grained "experts," sub-networks that specialize. Think of it as a vast staff of consultants where, for each question, only the two or three most relevant ever get called into the room. You keep the breadth of the whole organization but pay for only a sliver of it on each query.

Layered on top is a new sparse-attention mechanism the lab calls MSA, which attacks the other great cost of long-context models. Normally, letting a model "see" a million tokens at once is punishingly expensive, because every token must be compared against every other. MSA cuts the per-token compute at full one-million context to roughly one-twentieth of the previous generation, delivering a 9.7× faster prefill and a 15.6× faster decode. In plainer terms: it reads a very long document, and it does so without melting the data center.

The frontier used to be defined by who could spend the most. M3 reframes the contest around who can waste the least—and efficiency, unlike raw spending, is something a smaller player can win.

MiniMax M3 is the first open-weight model to combine frontier coding, a million-token context, and native multimodality in a single architecture.

— MiniMax, M3 technical release, June 1, 2026

Why "Open" Changes the Stakes

To understand why this matters, separate two questions that often get blurred: how good is the best model, and who gets to use a very good model. For the past few years, progress on the first question has dominated headlines. M3 is a salvo in the second. When a frontier-adjacent model is freely downloadable, capability stops being a subscription and becomes infrastructure. A startup in Nairobi, a university lab in São Paulo, a hospital system that cannot send patient data to a third party—all of them can now run a near-frontier model on their own terms.

That redistribution has a geopolitical edge. The leading open-weight releases of the past two years have increasingly come from Chinese labs, while the top American labs have largely kept their best models closed. M3 sharpens a strategic question the United States has been slow to answer: if the world's developers build on whichever capable model is free to use, the lab that gives its weights away may end up shaping the ecosystem more than the lab that guards them.

An open model now sits among the closed leaders—still behind the very top, no longer out of reach.

What People Actually Build With This

The abstract case for open weights becomes concrete the moment you ask what they unlock. A one-million-token context window is not a spec-sheet flourish; it is the difference between a model that can glance at a few files and one that can ingest an entire codebase, a full legal contract set, or a year of an organization's documents and reason across all of it at once. For a developer, that means an assistant that understands a whole project rather than a snippet. For a researcher, it means feeding in a sprawling corpus and asking questions that span the whole of it.

Because M3 is multimodal, that same window can hold images and video, not just text—a model that reads the screenshots in a bug report, the diagrams in a paper, the frames of a recording. And because it is open, none of this requires sending sensitive material to someone else's server. A hospital bound by privacy law, a bank with proprietary code, a government office that cannot ship its data abroad—each can now run a near-frontier system entirely inside its own walls. That single property, data sovereignty, is often worth more to an institution than a few points on a benchmark.

The flip side is the reason some researchers are uneasy. Open weights cannot be recalled. Once a capable model is downloaded by hundreds of thousands of people, no safety patch, no policy, and no change of heart can claw it back. The guardrails that closed labs apply at the API—refusing dangerous requests, monitoring for misuse—can be stripped from an open model by anyone willing to fine-tune it. The same openness that democratizes capability also distributes it to actors who will not play by anyone's rules. This is the unresolved tension at the center of the open-model movement, and M3, by pushing the open frontier forward, pushes that tension forward too.

The Caveats Worth Keeping

Skepticism is warranted, and the field has learned to apply it. Benchmark figures at launch are, almost universally, vendor-reported and not yet independently audited; some of M3's headline numbers remain unverified by third parties. Benchmarks also measure narrow slices of capability and can be gamed by training on data that resembles the test. The true test of a model is not its launch-day scorecard but how it performs in the messy hands of millions of developers over the following months. Open weights, helpfully, make that scrutiny possible in a way closed APIs never can—anyone can probe M3's failure modes directly.

The question is no longer whether an open model can reach the frontier. It is how long the frontier stays a place only the giants can afford to live.

— On the new shape of the AI race

What It Signals

Step back and M3 looks less like a single product and more like a marker on a curve. The capabilities that defined the absolute cutting edge eighteen months ago are now available, for free, to anyone with a capable server. That compression—frontier today, commodity tomorrow—is becoming the defining rhythm of the field, and it is accelerating. The architectural ideas that make M3 cheap to run, sparse experts and sparse attention, point toward a future where capability is limited less by how many chips you can buy and more by how cleverly you use the ones you have.

None of this means the closed labs are in trouble; they remain ahead at the very top, and they are not standing still. But the terms of the contest are shifting. For the researchers, founders, and tinkerers who were locked out of the frontier, the door just opened a little. The interesting question now is what they build once they walk through it.

Sources

MiniMax. "MiniMax M3 — Coding & Agentic Frontier, 1M Context, Multimodal." minimax.io
DataNorth AI. "MiniMax M3: Open-Weight Frontier Model with 1M Context." datanorth.ai
NYU Shanghai RITS. "MiniMax M3: Frontier Coding, 1M Context, and Sparse Attention." rits.shanghai.nyu.edu
TechTimes. "MiniMax M3 Open-Weight Coding Model: Frontier Claims, Unverified Benchmarks." techtimes.com
Codersera. "MiniMax M3: Developer Guide to the Open-Weight 1M-Context Frontier." codersera.com
Nerd Level Tech. "MiniMax M3: Open-Weight Coding at 1/10 the Cost." nerdleveltech.com
Lushbinary. "MiniMax M3 Developer Guide: Benchmarks & Pricing." lushbinary.com
VentureBeat. "Meta launches new proprietary AI model Muse Spark." venturebeat.com
LLM-Stats. "AI Updates Today (June 2026) — Latest AI Model Releases." llm-stats.com
Crescendo AI. "Latest AI News and Breakthroughs That Matter Most — June 2026." crescendo.ai

The Open Frontier

The Numbers That Made People Look Twice

The Trick Is in the Architecture

Why "Open" Changes the Stakes

What People Actually Build With This

The Caveats Worth Keeping

What It Signals

Sources

Related Reading

After the Transformer

The Recursive Moment

The Foundry Wars

The Chokepoint

The Machine That Needs a Planet

The Agent Inflection