A hematology decision-support agent was tested the hard way — not on old charts, but on real cases that hadn't been decided yet, at hospitals that had never trained it. That combination is rarer in clinical AI than it has any right to be.
Picture the room: a windowless conference space on the seventh floor of a university hospital, a long table, a wall monitor throwing pale light across a dozen tired faces. A hematologist reads out a case — 61-year-old woman, relapsed acute myeloid leukemia, a genomic panel bristling with mutations that don't play nicely together. Around the table: an oncologist, a pathologist, a genomicist, a pharmacist, a data manager. And, projected quietly on a second screen that nobody quite looks at directly, a recommendation that was generated hours earlier by software — before anyone in the room had said a word. By the time the discussion ends, the room's decision and the machine's suggestion match. This has now happened often enough, in enough hospitals, on enough real and not-yet-decided cases, that a paper describing it was published in Nature Microbiology on June 30, 2026 — and it is one of the first clinical-AI studies in oncology to have earned the right to make that claim without an asterisk.
A tumor board — sometimes called a multidisciplinary team meeting, sometimes a molecular tumor board when genomics are central — is the closest thing modern oncology has to a jury. Specialists from different disciplines convene, usually weekly, to review complicated cancer cases and agree on a plan: which drug, which trial, which sequence of therapies, whether to biopsy again, whether to stop treating altogether. It is slow, expensive, and utterly dependent on the specific mix of expertise sitting in the room that day.
Hematologic malignancies make this harder than almost any other cancer type. Leukemias, lymphomas, and myelomas are not lumps you can biopsy once and file away. They are moving targets: clonal populations that evolve under treatment pressure, respond to therapy in weeks rather than months, and get re-classified by the World Health Organization every few years as the underlying molecular biology comes into sharper focus. A single patient's case file might include serial bone marrow biopsies, flow cytometry, cytogenetics, a rotating cast of targeted-panel sequencing results, and a treatment history that branches every time the disease relapses. Asking a language model to reason over that is not like asking it to summarize a radiology report. It is like asking it to follow a novel with an unreliable narrator, written in five specialist dialects, where the plot keeps changing based on decisions made in earlier chapters.
That is precisely the setting in which the new study, led by researchers including Mirco J. Friedrich and colleagues, chose to test a locally deployable AI agent — one designed to be run inside a hospital's own infrastructure rather than piped through an external API, and grounded in the actual case record rather than general medical knowledge alone.
Here is the detail that separates this paper from the flood of "AI matches doctors" headlines that have become almost a genre unto themselves: the agent wasn't just checked against old, already-decided cases at the hospital where it was built. It was validated three ways — retrospectively, externally, and prospectively — and it is the second and third of those that matter most.
Retrospective validation means testing a model against historical cases whose outcomes are already known. It's useful, but it's also the easiest bar to clear, because the model, the researchers, and sometimes the training data itself all exist downstream of that history. External validation means testing the same model on cases from a different institution — one whose patient population, documentation habits, lab equipment, and clinical culture the model has never seen. Prospective validation is the rarest and hardest: testing the model on cases that have not yet been decided, in real time, so that its recommendation is locked in before the human tumor board deliberates and reaches its own conclusion. There is no way to retrofit a prospective study after the fact. The model has to commit to an answer before anyone in the room knows what the "right" answer will turn out to be.
Running all three checks on the same system, across multiple hospitals, is uncommon enough in clinical AI that it's worth pausing on. Most published studies — including ones that make confident claims about matching physician judgment — never leave the retrospective stage. This one did, and it did so specifically in hematology, a subfield where the case complexity gives a model every opportunity to fail quietly.
Regular readers may recall The Validation Gap, which covered a separate Nature Medicine benchmark from earlier this June showing general-purpose chatbots outperforming FDA-cleared clinical AI tools on certain tasks — and the uncomfortable regulatory question that raised about what "cleared" even certifies. That story was about the gap between regulatory approval and real-world competence. This one is different, though it rhymes: it's about the gap between what most clinical-AI research calls "validated" and what validation would need to mean to trust a system with a real, undecided case. The hematology tumor board study doesn't resolve the FDA-clearance question. But it does offer something the field has been short on — a working example of what clearing the higher bar actually looks like.
Those first three figures come from a separate, earlier Nature Cancer study (2025) that is worth holding up as a comparison point rather than a duplicate. Researchers built an autonomous AI agent — powered by GPT-4 plus a toolkit of vision transformers and image-segmentation models — and tested it on 20 realistic multimodal oncology cases. The agent used the correct clinical tool at the right moment 87.5% of the time, reached the correct overall clinical conclusion in 91.0% of cases, and correctly cited the relevant oncology guideline 75.5% of the time. For context, GPT-4 alone, without the tool-using agent scaffold, scored roughly 30% on comparable decision accuracy — jumping to around 87% once wrapped in an agentic framework that could call the right specialist tools. That gap, between a raw model and the same model wired into a workflow, is a big part of why "agentic" has become the operative word in this field: the intelligence isn't only in the underlying model, it's in the scaffolding that tells it when to look something up, when to defer, and when to check its own work.
None of this should read as an all-clear for AI oncology. The most sobering counterweight arrived via MTBBench, a 2025–2026 benchmark built specifically to simulate molecular-tumor-board-style reasoning across genuinely difficult cases — the kind involving imaging, lab values, pathology slides, genomic panels, and free text, all unfolding across multiple points in time, sometimes with data that contradicts itself the way real medical records do. MTBBench's dataset spans 66 patient cases and 573 expert-validated question-answer pairs, and its authors were blunt about what they found: even large, capable models are, in their words, unreliable. They hallucinate frequently. They struggle to reason across time-resolved data. They fumble when evidence from different modalities doesn't neatly agree. In some of the benchmark's toughest tasks, baseline agents fabricated file names that didn't exist, failed to retrieve genuinely new critical documents, or kept reusing stale context from earlier in a case — a failure mode that shows up most often in exactly the tasks a hematology tumor board cares about most: predicting outcomes and catching recurrence.
There is a silver lining buried in that same paper. When MTBBench's authors wrapped baseline models in an agentic framework equipped with foundation-model-based tools — giving the system structured ways to query images, retrieve records, and cross-check claims rather than relying on free-form generation — performance improved substantially: up to a 9.0% gain on multimodal reasoning tasks and up to 11.2% on longitudinal, time-resolved reasoning. That's the same underlying lesson as the Nature Cancer figures above: bare language models are not the story. Language models embedded in carefully engineered agent architectures, with tools, retrieval, and guardrails, are a meaningfully different and more capable story — though evidently still an incomplete one.
Two other Nature-family studies published this year make a related point from a different angle: agentic AI systems can meaningfully assist decision-making across the diagnosis, treatment, and hospital-admission stages of patient care. But in both cases, the researchers stopped short of calling either model ready for real-world clinical deployment. Promising performance on a benchmark and readiness to be trusted with an actual patient are not the same claim, and the field's more careful voices have been increasingly insistent about not conflating them.
Current LLMs, even at scale, lack the reliability required for real-world oncology decision support — frequently hallucinating and struggling to reason over time-resolved, multimodal, and sometimes conflicting clinical evidence.— Paraphrased finding, MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology (arXiv 2511.20490)
Both, in the sense that matters. The Nature Microbiology hematology study and MTBBench are not contradicting each other; they're describing different points on the same map. MTBBench is a stress test built to expose failure across the messiest, most adversarial version of the problem — long time horizons, conflicting modalities, cases explicitly chosen to be hard. The hematology tumor board agent, by contrast, was built and evaluated for a narrower, more disciplined task: producing a case-grounded recommendation that a real hematology team, working from the same chart, would independently reach. It is not a general oncology reasoning engine. It is a specific tool, tested against a specific, well-defined form of agreement, at more than one hospital, on cases that hadn't happened yet.
That narrowness is not a weakness — it may be the actual lesson. The systems that are approaching genuine clinical readiness in 2026 are not the ones promising to reason about everything. They're the ones with a tightly scoped job description, a validation protocol that includes prospective and external testing as table stakes rather than a stretch goal, and a resulting concordance rate that a hospital's own credentialing committee could actually evaluate on its own terms.
The published reaction from clinicians to systems like this has been notably more measured than the press coverage. The consistent theme across commentary on both the Nature Cancer 2025 agent and this year's hematology work is that these tools are being framed as decision support, not decision replacement — a second opinion that arrives before the human second opinion, not instead of it. Clinicians interviewed around comparable studies have repeatedly emphasized that a concordance rate, however high, describes agreement with what a board already decided or was about to decide; it does not, by itself, establish that the AI's underlying reasoning was sound, or that it would have caught the one-in-twenty case where the tumor board itself got it wrong. Encouragingly, this is also the argument researchers make for why prospective, external testing matters so much: it is the only design that even gives a wrong human decision the chance to diverge from a right AI one, and vice versa, in a way retrospective benchmarking structurally cannot.
Even a well-validated narrow success leaves real questions unresolved. Liability sits at the top of the list: if a hospital deploys a locally run agent that participates in a hematology tumor board's deliberation and a patient is harmed, does responsibility sit with the treating physician, the hospital's IT and compliance apparatus, or the model's developers — and does the answer change depending on whether the human team deviated from or followed the AI's suggestion? Over-reliance is the second concern, the one clinicians raise most often in private: a highly concordant tool, used for long enough, risks becoming a tool nobody double-checks, which is precisely the condition under which its failure modes — the same hallucination and reasoning gaps MTBBench documented — become dangerous rather than merely embarrassing. And edge cases remain the hardest problem of all, almost by definition: rare hematologic subtypes, patients with unusual comorbidities, genomic findings that don't fit any existing guideline cleanly, are exactly the cases where concordance with a historical tumor board pattern is least informative, because there may be no clear pattern to match.
If this study represents anything larger than one well-run trial, it's a template. The next wave of clinical-AI papers worth taking seriously will likely be judged less by their headline accuracy number and more by which of these three validation boxes they can honestly check: retrospective, external, prospective. Expect more hospitals to start running locally deployable agents in shadow mode — quietly generating recommendations that clinicians don't see until after they've made their own call — specifically so that prospective, blinded comparisons become routine rather than remarkable. Expect regulators, still digesting the questions raised by the FDA-clearance gap covered elsewhere on this site, to start asking not just "does it work" but "how was it tested, and by whom, and on whose patients." And expect the skeptics behind benchmarks like MTBBench to keep building harder tests, because a field that only measures its successes against yesterday's easy cases will eventually be surprised by tomorrow's hard one. The real second opinion here may not be the AI's. It's the one the research community keeps giving itself, by refusing to grade its own homework.
Share this article
A June 2026 benchmark found general-purpose chatbots beating FDA-cleared clinical AI — and exposed a regulatory blind spot.

Inside the general-purpose AI systems now attempting full disease diagnosis — and how far that ambition really reaches.
When GPT-5 joined the research team: how an AI collaborator helped crack a math problem and a biology mystery.

How computer vision is catching skin cancer earlier — and where it still needs a dermatologist's eye.
A single blood draw that reads DNA methylation patterns to catch cancer before symptoms appear.

Inside the AI supercomputers compressing a decade of pharmaceutical research into a fraction of the time.
© 2026 Lisa Pedrosa · lisapedrosa.com
All articles cited to primary institutional or peer-reviewed sources
Buy me a coffee