AI & Science · Field Notes

Ground Truth: Why AI Scientists Keep Tripping Over Broken Databases, Not Broken Reasoning

Anthropic sent AI agents to fetch viral sequence data from a public database virologists have used for years. The models weren't the problem. The plumbing was — and fixing it turned a coin-flip into a near-certainty.

July 3, 2026 · Lisa Pedrosa · ~10 min read · AI & Science · Biology

On May 14, 2026, lab technicians at INRB Kinshasa in the Democratic Republic of Congo ran diagnostic tests on thirteen blood samples. Eight came back positive for Bundibugyo virus, a cousin of Ebola rare enough that most virologists will never see a live case in their careers. A new outbreak was declared the next day. By May 29, the World Health Organization had logged more than a thousand confirmed and suspected cases and over two hundred deaths. Researchers sequenced the first outbreak genomes within days — a genuine feat of modern genomics. Then came the part nobody warns you about in the highlight reel: to figure out whether this was a new spillover, whether existing diagnostics could still detect it, and whether the antibody therapies stockpiled for exactly this scenario would still work, someone had to go compare the new genomes against historical Ebola virus records sitting in a public database. And that first step — the one before any of the actual science — still meant a human being manually clicking through a web interface, one filter at a time, hoping the resulting dataset was complete.

This is the story Anthropic told in June 2026, and it is a stranger and more important one than the headline "AI helps with biology" suggests. Because the finding wasn't that artificial intelligence is bad at science. It's that the world's best AI models, tested rigorously, kept failing at something that sounds almost embarrassingly basic — reliably fetching the right data from a public database — and that failure had nothing to do with how smart the models were.

A test built to be boring, and that's the point

The research, published as a post titled "Paving the way for agents in biology" and written by Laura Luebbert of the Broad Institute and FutureHouse, along with a technical preprint from a team that included Ferdous Nasri, Sarah Gurev, and NCBI's own Nuala A. O'Leary, set out to answer a narrow, almost bureaucratic-sounding question: can an AI agent correctly retrieve viral sequence data from NCBI Virus, the database virologists rely on for outbreak surveillance, vaccine design, and diagnostic development?

To test it, the team built a benchmark called VirBench: 120 realistic queries spanning 40 different pathogens, each with a manually verified correct answer. These weren't trick questions. They looked like the kind of request a working virologist writes in an internal Slack message — "pull every Zaire ebolavirus sequence collected in Africa between January and June 2014, from human hosts, at least 15,200 bases long, excluding lab-passaged samples." The kind of query where the answer is either right or it isn't. There's no partial credit for a phylogenetic dataset that's missing a third of its records.

They ran the benchmark against six frontier systems: Claude Sonnet 4, Claude Opus 4.7, the open-source biology agent Biomni, Edison Analysis, and two GPT models. The results, without any additional tooling, were a mess. Mean accuracy across the field ranged from a low of 16.9 percent — that was Claude Sonnet 4 — up to 91.3 percent for GPT-5.5. But the more damning number wasn't the average. It was the variance. Asked the identical Ebola virus query three separate times, Sonnet 4 returned 106 sequences on the first attempt, 15 on the second, and 5 on the third. The correct answer, confirmed by manual retrieval, was 266.

Same model. Same prompt. Three different answers. That's not a reasoning failure in the way people usually mean it — the model wasn't confused about virology. It was navigating a database whose filtering logic lives only inside a web interface built for human clicking, not machine querying, and it silently gave up partway through retrieval in a different place each time.

Why a wrong number here is not a rounding error

It would be easy to shrug this off as an implementation detail — annoying, but not the kind of thing that changes anything downstream. The researchers didn't let that stand. They took the flawed datasets Sonnet 4 had assembled and ran them through a standard piece of outbreak science: building a phylogenetic tree to estimate when an epidemic actually began, a number epidemiologists call the time to most recent common ancestor, or TMRCA.

A tree built from the properly curated dataset placed the root of the 2014 West African Ebola outbreak in January 2014 — matching published estimates almost exactly. A tree built from one of the AI-assembled datasets, missing sequences from Guinea, shifted that estimate to April 2014, changing the inferred timeline of the outbreak by three months. Another AI-run dataset was so incomplete it pushed the estimated origin back to 1922 — off by nearly a century. The same pattern showed up when the team asked which existing antibody treatments would still bind an evolving Ebola glycoprotein: three different runs of the same query produced three different, mutually inconsistent pictures of which mutations mattered. A dataset assembled with a five-minute confidence problem had quietly become an outbreak-response problem.

Using AI agents to navigate biological data infrastructure is like driving through an old city that was designed before cars: the infrastructure may be beautiful and even thoughtful, but it's full of narrow, winding streets that are difficult for modern vehicles to navigate.

— Laura Luebbert, "Paving the way for agents in biology," Anthropic, June 2026

That's the reframe worth sitting with. We tend to imagine AI limitations as a ceiling on intelligence — the model doesn't understand enough biology, doesn't reason deeply enough, needs a bigger brain. Luebbert's argument is closer to the opposite: coding agents raced ahead over the past few years not because writing software is intellectually easier than biology, but because software infrastructure — version control, documented APIs, package managers — was already built with machines as users in mind. Biological data infrastructure wasn't. It was built for a scientist with a mouse, institutional memory, and the patience to reproduce a colleague's complicated filter settings by hand. An agent dropped into that environment isn't underpowered. It's a car built for a highway, idling at the mouth of a medieval alley.

The fix that had nothing to do with a smarter model

Working with engineers at the National Center for Biotechnology Information — NCBI, the U.S. government body that maintains GenBank and much of the world's public genomic data — Anthropic's team built a tool called gget virus. It sounds unglamorous, and it is: gget virus doesn't reason, doesn't generate hypotheses, doesn't do anything an outside observer would call "intelligent." It coordinates three separate NCBI systems (the REST, Datasets, and E-utilities APIs), decides which filters can be checked through those APIs and which have to be verified locally because the web interface hides logic that no single endpoint exposes, handles the pagination and batching needed to pull comprehensive results for viruses with huge record counts like influenza A and SARS-CoV-2, and then hands back a standardized, fully logged output — one that shows not just the answer, but exactly how the tool arrived at it.

16.9%

Claude Sonnet 4 accuracy, no tool

92.8%

Claude Sonnet 4 accuracy, with gget virus

120

VirBench queries across 40 pathogens

99.7%

Peak accuracy, GPT-5.5 with gget virus

Once every model in the test had access to gget virus, every single one crossed 90 percent accuracy. Claude Sonnet 4 — the model that had scored 16.9 percent on its own — jumped to 92.8 percent. Run-to-run variability, the more alarming problem in some ways than the raw accuracy number, largely disappeared. And the gap between a frontier model and a cheaper one narrowed to the point of near-irrelevance: with the right deterministic tool in place, which specific AI model you happened to be using stopped mattering very much at all.

The sentence that reframes the whole AI-for-science conversation

Buried in the middle of the research writeup is a line that deserves to be quoted more than the headline statistic, because it is the actual thesis: "Reliable dataset construction should not depend on access to the newest or most expensive model, or on knowing which model works best for a given database." Read that again next to the ordinary way AI progress gets narrated — new model, bigger benchmark score, another leap forward — and the contrast is stark. Anthropic, a company whose business is literally building bigger, more expensive models, published a study whose central finding is that the model wasn't the bottleneck. The data plumbing was. And once you fix the plumbing, a cheaper model paired with the right tool can outperform an expensive model working alone.

Fig. 1 — Claude Sonnet 4's VirBench accuracy before and after gget virus, and how the tool coordinates three separate NCBI systems into one deterministic output.

The uncomfortable backdrop: no AI-discovered drug has ever reached the finish line

Anthropic didn't publish this research in a vacuum. On June 30, 2026, the company livestreamed a broader event called "The Briefing: AI for Science," unveiling a research-tuned version of Claude and gathering pharmaceutical and academic partners to make the case that AI is now doing real scientific work, not just assisting with paperwork around it. The event landed at an awkward moment for that pitch. As of that date, no drug wholly discovered by artificial intelligence has won FDA approval — not one, after nearly a decade of AI-drug-discovery startups raising billions of dollars on the promise that machine learning would compress a normally ten-year, multi-billion-dollar drug development process into a fraction of the time.

That's not for lack of trying. More than 200 AI-discovered drug candidates are currently in clinical trials, roughly fifteen of them in Phase III — the last and most expensive hurdle before a regulatory filing. Insilico Medicine's rentosertib, a treatment for a fatal lung-scarring disease, cleared an early trial with results published in a major medical journal and is now approaching Phase III. Relay Therapeutics' zovegalisib has advanced into Phase III with a regulatory fast-track designation. But the industry has also absorbed real disappointments — Recursion Pharmaceuticals discontinued its lead AI-discovered candidate after longer-term data failed to confirm earlier promising results. 2026 is shaping up to be the year several of these Phase III readouts finally arrive, which means it's also the year the industry either earns its first genuine AI-drug approval or has to reckon publicly with the gap between a decade of hype and an empty scoreboard.

If we want agents to help with scientific discovery, from outbreak response to drug design to biological modeling, we need to build biological data infrastructure that they can navigate as reliably as humans do.

— Laura Luebbert, Anthropic research, June 2026

Put those two stories side by side and something clicks into place. The AI-drug-discovery industry has spent years selling a narrative about model sophistication — bigger neural networks, better generative chemistry, smarter target identification. The VirBench findings suggest that narrative may have been aimed at the wrong layer of the stack entirely. If a frontier model can swing from 16.9 percent to 92.8 percent accuracy on a well-defined retrieval task purely by being handed a reliable data pipe — with zero change to its underlying intelligence — then it's worth asking how much of the "AI can't quite deliver a real drug yet" story is actually a story about messy, non-reproducible scientific databases rather than insufficiently clever algorithms.

An old problem with a very new label

None of this is really about artificial intelligence, if you squint. Computational biologists have complained for decades about the state of public genomic databases: inconsistent metadata, incompatible identifier systems, sequences duplicated across GenBank and RefSeq with no easy way to tell which is authoritative, filtering logic that exists only as institutional folklore passed between lab members. Tools like Biopython, Entrez Direct, and the original gget package were all attempts, long before large language models existed, to drag that mess into something a computer could reliably parse. What's new is the audience. When the only users of a clunky database were patient, expert humans who already knew its quirks, the friction was an annoyance. When millions of AI agents start hammering the same portals, expecting machine-grade reliability, the same friction becomes a hard failure mode — the kind that can quietly reshape a phylogenetic timeline or misjudge whether a stockpiled antibody will still neutralize a mutating virus.

Anthropic's own framing acknowledges this won't stay true forever. As models keep improving, the argument goes, agents may eventually get good enough to navigate a messy web portal on their own, recovering gracefully from the pagination failures and inconsistent filters that tripped up Sonnet 4. But even in that future, the researchers argue, cheap, auditable, deterministic tools will likely still matter — because a model that can eventually muscle its way through a confusing bioinformatics workflow isn't the same as a model that should have to, every single time, at real cost and with real risk of a plausible-looking wrong answer slipping through unchecked.

What this means if you don't work in a lab

You don't need to care about NCBI Virus's API architecture to feel the consequences of this story. Every prediction about AI transforming medicine, agriculture, or pandemic response quietly assumes that AI agents can reliably pull accurate, up-to-date facts from the world's scientific record. This research is a rare, concrete measurement of how far that assumption currently is from true — and how cheaply, in at least one important case, the gap could be closed. The uncomfortable part is how mundane the fix turned out to be. Not a new foundation model. Not a research breakthrough in machine reasoning. A carefully engineered piece of plumbing, built jointly with the government database maintainers who understood the quirks better than any outside team could reverse-engineer alone.

That's a less thrilling story than "AI discovers cure," and it may be the more important one. The next wave of genuinely useful AI-for-science claims will likely be judged less by which model powers them and more by whether anyone bothered to fix the unglamorous data infrastructure underneath — the equivalent of paving the road before boasting about the car. Expect more collaborations between AI labs and the institutions that actually maintain scientific data: NCBI, the European Bioinformatics Institute, structural databases, clinical trial registries. Expect the definition of "AI-ready" science to increasingly mean deterministic, well-logged, reproducible data access, not just a more capable model sitting on top of the same crumbling foundation. And expect 2026's Phase III drug trial results, whichever way they land, to be read by a slightly more skeptical, slightly better-informed audience — one that now has a name for the gap between promise and plumbing.

Sources

Luebbert, L. "Paving the way for agents in biology." Anthropic Research, June 8, 2026. anthropic.com/research/agents-in-biology
Nasri, F., Gurev, S., Varilly, P., Ramesh, K., O'Leary, N. A., Cool, J., Renard, B. Y., Sabeti, P., Luebbert, L. "Deterministic access to global viral sequence data enables robust agentic scientific discovery." arXiv:2606.06749, 2026. arxiv.org/pdf/2606.06749
arXiv HTML version. arxiv.org/html/2606.06749v1
Anthropic. "The Briefing: AI for Science" virtual event page, June 30, 2026. anthropic.com/events/the-briefing-ai-for-science-virtual-event
Anthropic. "The Second Opinion" and Anthropic Science hub. anthropic.com/science
"AI Agents in Biology Are Too Inaccurate to Use: Anthropic's Deterministic Tool Is the Fix." Tech Times, June 9, 2026. techtimes.com
"Anthropic Livestreams AI for Science Pitch as No AI Drug Has Won FDA Approval." Tech Times, June 30, 2026. techtimes.com
"Anthropic's AI for Biology: The Accuracy Crisis Explained." byteiota, 2026. byteiota.com
"Anthropic Research Reveals: AI Biology Agents Are Stalling — The Bottleneck Isn't Models, It's Data Infrastructure." BigGo Finance, 2026. finance.biggo.com
"Anthropic VirBench: Why Biological Agents Need Deterministic Tools Like gget virus." explainx.ai Blog, 2026. explainx.ai
"Anthropic says AI can run science experiments now rather than just plan them." R&D World, June 2026. rdworldonline.com
"5 takeaways from Anthropic's big science event." Fast Company, June 2026. fastcompany.com
"AI-Discovered Drugs Reach Phase III. And 2026 Will Determine Whether All the Promises Were Real." humai.blog, 2026. humai.blog
"AI Drug Discovery 2026: 173 Programs, FDA Framework & Market." Axis Intelligence, 2026. axis-intelligence.com
World Health Organization. "Experts convened by WHO advise on candidate treatments and vaccines for Ebola disease caused by Bundibugyo virus." May 28, 2026. who.int

Ground Truth: Why AI Scientists Keep Tripping Over Broken Databases, Not Broken Reasoning

A test built to be boring, and that's the point

Why a wrong number here is not a rounding error

The fix that had nothing to do with a smarter model

The sentence that reframes the whole AI-for-science conversation

The uncomfortable backdrop: no AI-discovered drug has ever reached the finish line

An old problem with a very new label

What this means if you don't work in a lab

Sources

Related reading

The Co-Scientist

The Motion Problem

The Drug Engine

AI: The Engine of Discovery

The Defection

The Resistance