When Galileo pointed a telescope at Jupiter in January 1610 and saw four small points of light that moved relative to the planet night after night, he wasn't just observing more clearly. He had a new instrument — one that made a whole category of discovery possible that had been structurally impossible before. On May 19, 2026, two papers published back-to-back in Nature suggested that something comparably significant may be happening to science right now. Two independent research teams, working separately, arrived at the same conclusion in the same week: AI can now function as a genuine scientific co-investigator.
The papers that landed the same week
The two systems are architecturally similar but independently built. Co-Scientist, from Google DeepMind, is a multi-agent system built on Gemini 2.0. Robin, from FutureHouse — a nonprofit with the explicit mission of automating scientific discovery — runs on a combination of Anthropic's Claude 3.7 and OpenAI's o4-mini. Both are multi-agent systems: collections of specialised AI agents coordinated by a supervising layer to execute different steps of the research process.
Both teams published not just descriptions of their systems, but experimental results. Not thought experiments or benchmark scores — real laboratory validation of AI-generated hypotheses.
Novel combination therapies for acute myeloid leukaemia
AI-proposed drug repurposing candidates confirmed to inhibit tumour viability in multiple AML cell lines at clinically relevant concentrations.
Vorinostat as an anti-fibrotic agent
Co-Scientist identified the FDA-approved anti-cancer drug Vorinostat as a liver fibrosis candidate. In hepatic organoid tests: 91% reduction in TGFβ-induced chromatin structural change.
Ripasudil as a novel treatment for dry macular degeneration
Robin identified a drug already used in ophthalmology — the glaucoma treatment ripasudil — as a novel candidate for dAMD, a condition it had never been proposed for. Lab-validated in RPE cell culture.
Mechanisms of antimicrobial resistance gene transfer
Co-Scientist generated novel hypotheses about the evolutionary mechanisms by which resistance genes spread between bacteria — relevant to the global antimicrobial resistance crisis.
None of these are clinical results. None of the drugs have entered human trials. The researchers from both teams are careful to note that preclinical validation is necessary before any therapeutic claim can be made. What is validated is not the drug — it is the process: that AI can generate scientifically useful hypotheses, propose and guide experiments to test them, and identify candidates that had not occurred to human researchers working in the field.
The architecture of a machine scientist
Both systems share a fundamental design philosophy: the discovery process is decomposed into specialised steps, each handled by a different agent, coordinated by a supervisor. The decomposition mirrors the actual structure of scientific research — literature review, hypothesis formation, experimental design, data analysis, interpretation, refinement — and assigns each step to a component optimised for it.
General purpose across disciplines. Uses a "hypothesis tournament" — an internal review board that tests each proposed hypothesis against the existing scientific literature before it advances. This addresses hallucination: a proposal that contradicts established evidence is challenged internally before it becomes an experimental directive.
Designed by Vivek Natarajan et al. at Google DeepMind. Built on Gemini 2.0. Validated across biomedicine.
Drug discovery specialist. Three core agents: Crow (literature search), Falcon (candidate evaluation), Finch (data analysis — writes and executes its own Python and R code). A "built-in brake" restricts Robin to established knowledge and limits irrational leaps in logic — its version of hallucination mitigation.
Designed by Sam Rodriques, Michaela Hinks et al. at FutureHouse. Claude 3.7 + o4-mini. Open-sourced.
The human researchers in both cases executed the physical laboratory experiments — culturing cells, running assays, preparing organoids. But the intellectual scaffolding — what to test, why, what to do with the results, what to try next — came from the AI. All hypotheses, experimental choices, data analyses, and figures in Robin's dAMD paper were generated autonomously. The paper was written by AI. Human researchers validated it.
"These systems are designed to collaborate with researchers, and a scientist would always be in the loop. The real-world demonstrations from both groups provide examples of what the future of scientific research with AI agents might look like."— Nature Press Release, May 19, 2026
What the same papers reveal
The more honest the science, the more it reports what failed alongside what succeeded. Both Nature papers document limitations clearly — and the same limitations have been documented in independent analysis of the results.
The most fundamental is structural: both systems communicate through natural language. Language is the medium that makes AI scientists accessible to human researchers. It is also a medium with inherent imprecision. Scientific communication requires exact quantities, precise units, and unambiguous descriptions of experimental conditions. Natural language approximates all of these. "Increase phagocytosis significantly" is a hypothesis. "Increase phagocytosis by ≥40% at 10μM concentration in ARPE-19 cells within 24 hours of treatment" is a scientific claim. The distance between those two formulations is the distance between promising and reproducible.
Language-based AI systems face a structural limit in biology: the quantitative complexity of biological systems — dose-response curves, off-target effects, cell-line-specific behaviour, time dependence — cannot be fully encoded in natural language prompts and responses. A model that reasons primarily through text cannot natively represent a three-dimensional protein binding site, a pharmacokinetic equation, or the statistical distribution of results across a cell population.
Both teams took care to limit hallucination — confident, false assertions. Co-Scientist uses an internal review board; Robin uses a literature-grounding brake. These are meaningful mitigations. They do not resolve the deeper issue: language is not biology's native language.
A second limitation is domain breadth. Robin, applied to a well-studied disease with a rich literature (dAMD), performed remarkably. It is less clear how the same system would operate in a less-characterised domain — a disease with sparse literature, contradictory findings, or research primarily published in non-English languages. The systems are powerful precisely because they have enormous amounts of literature to synthesise. In domains where that literature is thin, their advantage narrows.
A third limitation, acknowledged in the Nature editorial accompanying both papers, is the boundary between hypothesis generation and experimental verification. AI systems currently cannot perform wet lab experiments. They cannot observe that an assay result looks anomalous, cannot smell that a reagent has degraded, cannot notice that a cell culture behaved differently than expected in a way not captured by the data file. The human researcher's embodied presence in the laboratory is not yet replaceable, and may not be for a long time.
A new instrument, not a replacement
Researchers at Stanford HAI's AI+Science conference in May 2026 reached for the same historical analogy independently: the telescope, the microscope. What those instruments had in common was not that they replaced scientists, but that they made a category of discovery possible that had been structurally impossible before. Stars were always there. Cells were always there. The knowledge was latent, waiting for an instrument sensitive enough to reveal it.
What is latent in the scientific literature today — in the 50 million biomedical papers that no single human researcher can read, in the cross-domain connections that disciplinary specialisation makes invisible, in the pattern that emerges only when you can hold the entire literature in working memory at once — is unknown. The argument both Nature papers make, implicitly, is that some of it is accessible now. Robin identified ripasudil for dAMD not because it reasoned better than an ophthalmologist, but because it read more widely, without the disciplinary borders that constrain expert thinking.
The rate at which scientific knowledge is produced has been accelerating for decades. The rate at which any single researcher can absorb and synthesise that knowledge has not. AI scientists, at their best, do not outthink human researchers. They out-read them.
Whether this constitutes a telescope moment — whether future scientists will look back at May 2026 as the month the instrument arrived — depends on how these systems perform as they scale, diversify, and encounter harder problems. The results so far are a proof of concept, not a proof of general capability. But the structure of both papers — independent teams, different architectures, same week, same conclusion, peer-reviewed in Nature — is the structure of a real signal.
The question worth sitting with is this: if AI can already traverse 50 million papers and identify a drug nobody thought to try — what else is in there?
Buy me a coffee