What is GeneBench-Pro and what does it test?

GeneBench-Pro is a genomics benchmark published by OpenAI on June 30, 2026, consisting of 129 synthetic research-grade problems across 21 sub-domains including statistical genetics, population genomics, and clinical pharmacogenomics. It tests whether AI models can perform judgment-heavy biological analysis on deliberately messy datasets, not just pattern-match against known answers.

What did the best AI model score on GeneBench-Pro?

GPT-5.6 Sol Pro, OpenAI's top model at its maximum reasoning setting, passed only 31.5% of GeneBench-Pro problems. The best non-OpenAI model, Claude Opus 4.8, scored 16.0%, while a human expert would be expected to solve the problems correctly with roughly 100% accuracy given enough time.

Why do AI models fail on hard scientific reasoning benchmarks?

Audits of failed GeneBench-Pro responses found that models typically applied statistically plausible methods without verifying their underlying assumptions, under-weighted evidence of confounding when a simpler wrong explanation was available, and struggled most when the correct answer depended on reasoning about how data was collected. Models performed best in domains with standardized analytical pipelines and worst where expert judgment about the data-generating process was central.

How is GeneBench-Pro different from other AI science benchmarks?

Every problem is generated synthetically from a fully known ground-truth causal structure, so grading is entirely deterministic with no subjective rubric. The datasets are also deliberately contaminated with measurement error, selection bias, confounding variables, and quality-control failures that mirror real-world research challenges, closing the common loopholes of training-data leakage and inconsistent human grading.

What tasks can AI models reliably help with in biology research right now?

Current frontier models perform well on pattern-recognition-heavy tasks such as flagging candidate variants, summarizing scientific literature, and drafting analysis code. They remain unreliable for tasks requiring independent judgment about which of several plausible causal explanations is correct in the presence of ambiguous evidence, where human oversight is still essential.

AGI & Scientific Discovery

The Judgment Gap

OpenAI built a genomics exam no model had seen before — 129 research-grade problems, each one the kind of messy, judgment-heavy analysis a real biologist spends weeks untangling. The best AI in the world passed less than a third of it.

July 5, 2026 By Lisa Pedrosa 11 min read AGI · Genomics

Ask a frontier AI model to summarize a genomics paper, and it will do so fluently, often better than a tired postdoc at 11 p.m. Ask it to actually be the postdoc — to take a noisy, real-world dataset full of measurement error and confounding, choose the right statistical model among several plausible candidates, and defend a specific scientific conclusion — and something changes. On June 30, 2026, OpenAI published a benchmark built to find exactly that seam. It is called GeneBench-Pro, and it just gave the AI-for-science hype cycle its most uncomfortable data point yet.

The benchmark presents 129 synthetic problems spanning ten domains and twenty-one sub-domains of genomics, quantitative biology, and translational medicine — statistical genetics, population genomics, regulatory omics, clinical pharmacogenomics, cancer somatic genomics, forensic genetics, and more. Each problem pairs a realistic, deliberately messy dataset with a specific downstream decision: is this variant pathogenic, does this drug-response signal replicate, is this forensic match reliable. Reviewers estimated that a human expert would need 20 to 40 hours to solve a single problem properly. OpenAI's best model, GPT-5.6 Sol Pro, solved 31.5% of them at its maximum reasoning setting.

31.5%

Pass rate of the top model, GPT-5.6 Sol Pro

16.0%

Pass rate of the best non-OpenAI model, Claude Opus 4.8

129

Research-grade problems across 21 sub-domains

20–40 hrs

Estimated time for a human expert to solve one problem

Why This Benchmark Is Different

Most AI benchmarks that claim to test "scientific reasoning" quietly cheat in one of two ways: they ask questions with answers already in the training data, or they rely on human graders whose rubrics vary enough to make small performance differences meaningless. GeneBench-Pro was built to close both loopholes. Every problem is generated synthetically from a fully known, ground-truth causal structure — meaning the designers know the correct answer with certainty, because they built the underlying biology themselves, in software, before disguising it as a real dataset. That lets grading be entirely deterministic. A model either recovers the true structure or it doesn't; there's no rubric to argue with.

The problems are also deliberately contaminated with the specific failure modes that make real research hard: measurement error that isn't clearly labeled as such, selection bias baked into how the synthetic "samples" were gathered, confounding variables that correlate with the outcome for reasons that have nothing to do with the true biology, and quality-control failures that a careless analyst would miss entirely. A model has to notice these problems exist before it can even begin to correct for them — the exact skill that separates a first-year graduate student from a principal investigator.

GeneBench-Pro tests whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires — not pattern-matching against known answers, but reasoning through ambiguity toward a defensible conclusion.

— Framing of GeneBench-Pro's design goals, OpenAI

A Sobering Counterpoint to the Co-Scientist Narrative

The timing matters. Just five days before GeneBench-Pro's release, an AI system reasoning alongside researchers had helped crack both a forty-year-old mathematics problem and a three-year-old biology mystery in a single week — the kind of story that fuels headlines about AI "co-scientists" closing in on autonomous discovery. Google DeepMind's AlphaProof Nexus, meanwhile, has spent the spring resolving open Erdős problems and a 15-year-old question in algebraic geometry, for a few hundred dollars a proof.

GeneBench-Pro is the other half of the picture. On ARC-AGI-3, a benchmark of handcrafted reasoning puzzles designed to have no memorizable answer, every frontier model scores below 1% — Google's Gemini 3.1 Pro leads at just 0.37%, ahead of GPT-5.4 at 0.26% and Claude Opus 4.6 at 0.25%, against untrained humans who solve the same puzzles with effectively perfect accuracy. Nature has separately reported that humans still outperform AI on a newly designed, highly rigorous mathematics test built specifically to avoid contamination from training data. Put together, these results describe a consistent pattern: today's frontier models are extraordinary at recombining and extending patterns they've seen before, and still fragile the moment a problem requires genuinely novel judgment under uncertainty — precisely the condition every unsolved scientific question puts in front of a researcher.

This is not an argument that AI is useless for science — AlphaFold reshaped structural biology, and AI-assisted teams are genuinely resolving open conjectures. It's an argument that "AI co-scientist" and "autonomous AI scientist" describe two very different capabilities, and 2026's headlines have been conflating them.

What "20% Better Than Last Time" Actually Means

OpenAI's earlier genomics benchmark, before this "Pro" revision, was already showing meaningful year-over-year gains — part of why the company built a harder version rather than resting on the old one. GeneBench-Pro's jump in difficulty means the roughly 70% of problems still out of reach for the best model isn't evidence that progress has stalled. It's evidence that the goalposts were deliberately moved to somewhere models hadn't yet reached, and the field found the new edge almost immediately. That is, in its own way, reassuring: it means researchers are still able to design tests hard enough to be informative, rather than watching every benchmark saturate to 95% within a year of release, which had started to become a real methodological problem across the industry.

This is a live illustration of what statisticians call Goodhart's Law — once a benchmark becomes a target, it stops being a reliable measure — playing out in real time across the AI industry. Older reasoning benchmarks, once genuinely difficult, are now solved well enough that they no longer distinguish frontier models from each other. GeneBench-Pro's synthetic, ground-truth generation process is partly a defense against this decay: because the underlying causal structure can be endlessly regenerated with new random seeds, the benchmark can be refreshed each time models start to game it, rather than requiring an entirely new benchmark from scratch every eighteen months.

The Practical Stakes for Labs Right Now

For working biologists, the message of GeneBench-Pro is less about whether AI is "ready" in the abstract and more about where the line currently sits. Pattern-recognition-heavy tasks — flagging candidate variants, summarizing literature, drafting analysis code — are squarely inside what today's models do well. Tasks requiring a model to independently decide which of several plausible causal stories is correct, in the presence of deliberately ambiguous evidence, remain a place where human oversight isn't a formality. It's the difference between a very capable research assistant and an independent investigator, and GeneBench-Pro is the first benchmark that puts a specific number on that gap: roughly 70 percentage points, as of June 2026.

NewtonBench, introduced at ICLR 2026, found a similar pattern across 324 tasks in twelve physics domains: models could apply known laws fluently but struggled to rediscover them from raw experimental interaction — the exact judgment-under-uncertainty skill GeneBench-Pro is now measuring in biology.

— Cross-referenced findings, ICLR 2026 NewtonBench study

Where the Failures Actually Happen

The most useful part of GeneBench-Pro isn't the headline pass rate, it's the failure taxonomy behind it. Reviewers who audited incorrect model responses found that failures clustered in a small number of recurring patterns: models frequently selected a statistically plausible method without checking whether its underlying assumptions actually held for the dataset in front of them; they tended to under-weight evidence of confounding when a simpler, wrong explanation was available; and they struggled most in problems from population genomics and forensic genetics, where the "right" answer depends on carefully reasoning about how a sample was collected rather than on any calculation once it's in hand. Notably, models performed best in regulatory and functional genomics — domains with more standardized analytical pipelines and less ambiguity about which method applies.

That pattern lines up with something biologists have suspected but rarely had numbers for: today's models are strongest exactly where the analytical playbook is well established, and weakest exactly where expert judgment about the data-generating process is the whole point of the exercise. It's a more precise diagnosis than "AI isn't ready for science" — it's closer to "AI is ready for the parts of science that already have a checklist, and not yet ready for the parts that require deciding whether the checklist even applies."

Why This Matters

Every few months brings a new headline about AI "doing science" — cracking a conjecture, designing a protein, proposing a drug candidate. Those stories are true, and they matter. But GeneBench-Pro is a useful corrective precisely because it was designed by the same industry producing those headlines, not by skeptics trying to deflate them. When OpenAI's own hardest genomics test shows its own best model failing more than two-thirds of research-grade problems, that's a more credible signal about the actual state of AI-for-science than any single success story, in either direction. The honest picture for mid-2026 is a genuine capability gain running in parallel with a genuine, well-measured judgment gap — and knowing exactly where that second gap sits is what will let biologists deploy these tools where they help and keep a human in the loop where they still, provably, don't.

For graduate programs training the next generation of computational biologists, that distinction is becoming part of the curriculum itself. Several genomics departments have begun explicitly teaching students to identify the GeneBench-Pro failure categories — unchecked model assumptions, under-weighted confounding, sampling bias — as a checklist for reviewing both their own analyses and any AI-assisted output they lean on. The skill the benchmark measures in machines, in other words, is quietly becoming the skill graduate programs measure in people: not can you run the analysis, but can you tell when the analysis is quietly wrong.

Sources

Introducing GeneBench-Pro — OpenAI
OpenAI Genomics Benchmark: AI Judgment Gap Exposed in Research-Grade Tasks —

🔗Share on LinkedIn

Related Reading

AI & Science

The Co-Scientist

How AI helped crack a forty-year-old math problem and a three-year-old biology mystery in one week.

AI & Mathematics

The Shape of Infinity

Inside AlphaProof Nexus and the AI system resolving decades-old open mathematical conjectures.

AI & Science

Ground Truth

Why AI scientists keep tripping over broken databases, not broken reasoning.

AI & Science

The Validation Gap

Why so many AI-generated scientific claims stall before they ever reach a lab bench.

AI & Medicine

The Drug Engine

How AI stopped assisting drug discovery and started becoming its factory floor.

Robotics · Policy

The Mandate

Inside China's nationwide push to put a robot in every factory, hospital, and warehouse.

LISA PEDROSA

Reporting on the science and technology reshaping how we live, think, and work.

Explore

All Articles

About

Contact

Topics

AI & Science

Medicine

AI Governance

Buy me a coffee

The Judgment Gap

Why This Benchmark Is Different

A Sobering Counterpoint to the Co-Scientist Narrative

What "20% Better Than Last Time" Actually Means

The Practical Stakes for Labs Right Now

Where the Failures Actually Happen

Why This Matters

Sources

Related Reading

The Co-Scientist

The Shape of Infinity

Ground Truth

The Validation Gap

The Drug Engine

The Mandate