AGI & Scientific Discovery

The Judgment Gap

OpenAI built a genomics exam no model had seen before — 129 research-grade problems, each one the kind of messy, judgment-heavy analysis a real biologist spends weeks untangling. The best AI in the world passed less than a third of it.

July 5, 2026 By Lisa Pedrosa 11 min read AGI · Genomics
31.5%

Ask a frontier AI model to summarize a genomics paper, and it will do so fluently, often better than a tired postdoc at 11 p.m. Ask it to actually be the postdoc — to take a noisy, real-world dataset full of measurement error and confounding, choose the right statistical model among several plausible candidates, and defend a specific scientific conclusion — and something changes. On June 30, 2026, OpenAI published a benchmark built to find exactly that seam. It is called GeneBench-Pro, and it just gave the AI-for-science hype cycle its most uncomfortable data point yet.

The benchmark presents 129 synthetic problems spanning ten domains and twenty-one sub-domains of genomics, quantitative biology, and translational medicine — statistical genetics, population genomics, regulatory omics, clinical pharmacogenomics, cancer somatic genomics, forensic genetics, and more. Each problem pairs a realistic, deliberately messy dataset with a specific downstream decision: is this variant pathogenic, does this drug-response signal replicate, is this forensic match reliable. Reviewers estimated that a human expert would need 20 to 40 hours to solve a single problem properly. OpenAI's best model, GPT-5.6 Sol Pro, solved 31.5% of them at its maximum reasoning setting.

31.5%
Pass rate of the top model, GPT-5.6 Sol Pro
16.0%
Pass rate of the best non-OpenAI model, Claude Opus 4.8
129
Research-grade problems across 21 sub-domains
20–40 hrs
Estimated time for a human expert to solve one problem

Why This Benchmark Is Different

Most AI benchmarks that claim to test "scientific reasoning" quietly cheat in one of two ways: they ask questions with answers already in the training data, or they rely on human graders whose rubrics vary enough to make small performance differences meaningless. GeneBench-Pro was built to close both loopholes. Every problem is generated synthetically from a fully known, ground-truth causal structure — meaning the designers know the correct answer with certainty, because they built the underlying biology themselves, in software, before disguising it as a real dataset. That lets grading be entirely deterministic. A model either recovers the true structure or it doesn't; there's no rubric to argue with.

The problems are also deliberately contaminated with the specific failure modes that make real research hard: measurement error that isn't clearly labeled as such, selection bias baked into how the synthetic "samples" were gathered, confounding variables that correlate with the outcome for reasons that have nothing to do with the true biology, and quality-control failures that a careless analyst would miss entirely. A model has to notice these problems exist before it can even begin to correct for them — the exact skill that separates a first-year graduate student from a principal investigator.

GeneBench-Pro tests whether models can handle the kind of judgment-heavy analysis that real-world computational biology requires — not pattern-matching against known answers, but reasoning through ambiguity toward a defensible conclusion.
— Framing of GeneBench-Pro's design goals, OpenAI

A Sobering Counterpoint to the Co-Scientist Narrative

The timing matters. Just five days before GeneBench-Pro's release, an AI system reasoning alongside researchers had helped crack both a forty-year-old mathematics problem and a three-year-old biology mystery in a single week — the kind of story that fuels headlines about AI "co-scientists" closing in on autonomous discovery. Google DeepMind's AlphaProof Nexus, meanwhile, has spent the spring resolving open Erdős problems and a 15-year-old question in algebraic geometry, for a few hundred dollars a proof.

GeneBench-Pro is the other half of the picture. On ARC-AGI-3, a benchmark of handcrafted reasoning puzzles designed to have no memorizable answer, every frontier model scores below 1% — Google's Gemini 3.1 Pro leads at just 0.37%, ahead of GPT-5.4 at 0.26% and Claude Opus 4.6 at 0.25%, against untrained humans who solve the same puzzles with effectively perfect accuracy. Nature has separately reported that humans still outperform AI on a newly designed, highly rigorous mathematics test built specifically to avoid contamination from training data. Put together, these results describe a consistent pattern: today's frontier models are extraordinary at recombining and extending patterns they've seen before, and still fragile the moment a problem requires genuinely novel judgment under uncertainty — precisely the condition every unsolved scientific question puts in front of a researcher.

This is not an argument that AI is useless for science — AlphaFold reshaped structural biology, and AI-assisted teams are genuinely resolving open conjectures. It's an argument that "AI co-scientist" and "autonomous AI scientist" describe two very different capabilities, and 2026's headlines have been conflating them.
16.0% Claude Opus 4.8 28.7% GPT-5.6 Sol 31.5% GPT-5.6 Sol Pro ~100% Human expert

What "20% Better Than Last Time" Actually Means

OpenAI's earlier genomics benchmark, before this "Pro" revision, was already showing meaningful year-over-year gains — part of why the company built a harder version rather than resting on the old one. GeneBench-Pro's jump in difficulty means the roughly 70% of problems still out of reach for the best model isn't evidence that progress has stalled. It's evidence that the goalposts were deliberately moved to somewhere models hadn't yet reached, and the field found the new edge almost immediately. That is, in its own way, reassuring: it means researchers are still able to design tests hard enough to be informative, rather than watching every benchmark saturate to 95% within a year of release, which had started to become a real methodological problem across the industry.

This is a live illustration of what statisticians call Goodhart's Law — once a benchmark becomes a target, it stops being a reliable measure — playing out in real time across the AI industry. Older reasoning benchmarks, once genuinely difficult, are now solved well enough that they no longer distinguish frontier models from each other. GeneBench-Pro's synthetic, ground-truth generation process is partly a defense against this decay: because the underlying causal structure can be endlessly regenerated with new random seeds, the benchmark can be refreshed each time models start to game it, rather than requiring an entirely new benchmark from scratch every eighteen months.

The Practical Stakes for Labs Right Now

For working biologists, the message of GeneBench-Pro is less about whether AI is "ready" in the abstract and more about where the line currently sits. Pattern-recognition-heavy tasks — flagging candidate variants, summarizing literature, drafting analysis code — are squarely inside what today's models do well. Tasks requiring a model to independently decide which of several plausible causal stories is correct, in the presence of deliberately ambiguous evidence, remain a place where human oversight isn't a formality. It's the difference between a very capable research assistant and an independent investigator, and GeneBench-Pro is the first benchmark that puts a specific number on that gap: roughly 70 percentage points, as of June 2026.

NewtonBench, introduced at ICLR 2026, found a similar pattern across 324 tasks in twelve physics domains: models could apply known laws fluently but struggled to rediscover them from raw experimental interaction — the exact judgment-under-uncertainty skill GeneBench-Pro is now measuring in biology.
— Cross-referenced findings, ICLR 2026 NewtonBench study

Where the Failures Actually Happen

The most useful part of GeneBench-Pro isn't the headline pass rate, it's the failure taxonomy behind it. Reviewers who audited incorrect model responses found that failures clustered in a small number of recurring patterns: models frequently selected a statistically plausible method without checking whether its underlying assumptions actually held for the dataset in front of them; they tended to under-weight evidence of confounding when a simpler, wrong explanation was available; and they struggled most in problems from population genomics and forensic genetics, where the "right" answer depends on carefully reasoning about how a sample was collected rather than on any calculation once it's in hand. Notably, models performed best in regulatory and functional genomics — domains with more standardized analytical pipelines and less ambiguity about which method applies.

That pattern lines up with something biologists have suspected but rarely had numbers for: today's models are strongest exactly where the analytical playbook is well established, and weakest exactly where expert judgment about the data-generating process is the whole point of the exercise. It's a more precise diagnosis than "AI isn't ready for science" — it's closer to "AI is ready for the parts of science that already have a checklist, and not yet ready for the parts that require deciding whether the checklist even applies."

Why This Matters

Every few months brings a new headline about AI "doing science" — cracking a conjecture, designing a protein, proposing a drug candidate. Those stories are true, and they matter. But GeneBench-Pro is a useful corrective precisely because it was designed by the same industry producing those headlines, not by skeptics trying to deflate them. When OpenAI's own hardest genomics test shows its own best model failing more than two-thirds of research-grade problems, that's a more credible signal about the actual state of AI-for-science than any single success story, in either direction. The honest picture for mid-2026 is a genuine capability gain running in parallel with a genuine, well-measured judgment gap — and knowing exactly where that second gap sits is what will let biologists deploy these tools where they help and keep a human in the loop where they still, provably, don't.

For graduate programs training the next generation of computational biologists, that distinction is becoming part of the curriculum itself. Several genomics departments have begun explicitly teaching students to identify the GeneBench-Pro failure categories — unchecked model assumptions, under-weighted confounding, sampling bias — as a checklist for reviewing both their own analyses and any AI-assisted output they lean on. The skill the benchmark measures in machines, in other words, is quietly becoming the skill graduate programs measure in people: not can you run the analysis, but can you tell when the analysis is quietly wrong.

Sources

  1. Introducing GeneBench-Pro — OpenAI
  2. OpenAI Genomics Benchmark: AI Judgment Gap Exposed in Research-Grade Tasks —
Ko-fi Buy me a coffee
Scroll to Top