what is a tumor board and how does AI fit into it

A tumor board is a multidisciplinary meeting where specialists in oncology, pathology, genomics, and pharmacy convene to agree on a treatment plan for complex cancer cases. The June 2026 Nature Microbiology study tested an AI agent that generated hematology recommendations before the human tumor board met, then compared the two conclusions to measure concordance.

what is the difference between retrospective and prospective AI validation in clinical studies

Retrospective validation tests a model against historical, already-decided cases, which is the easiest bar to clear. Prospective validation requires locking in the AI's recommendation before the human team deliberates on a real, undecided case, making it impossible to retrofit results after the fact and far more meaningful as a test of real-world reliability.

how accurate is AI at making oncology decisions compared to doctors

In the 2025 Nature Cancer study, an autonomous AI agent reached the correct clinical conclusion in 91.0 percent of 20 multimodal oncology cases and used the correct clinical tool 87.5 percent of the time. However, the MTBBench benchmark found that even large models hallucinate frequently and struggle with time-resolved, conflicting clinical data, indicating current AI is not reliably ready for real-world deployment.

what is MTBBench and what did it find about AI in oncology

MTBBench is a 2025-2026 benchmark designed to simulate molecular tumor board reasoning across 66 patient cases and 573 expert-validated question-answer pairs involving imaging, lab values, pathology, genomics, and free text. Its authors found that even capable AI models hallucinate, fail to reason across time-resolved data, and sometimes fabricate file names or reuse stale context, concluding that current LLMs lack the reliability required for real-world oncology decision support.

why do most clinical AI studies not test prospectively or externally

Prospective testing requires the AI to commit to an answer before the human team decides, and external testing requires a separate hospital willing to run an unproven system on live cases, both of which are logistically demanding and uncommon. Most published studies stop at retrospective testing against static, already-adjudicated datasets at the same institution that built the model, which the article describes as a grade-your-own-homework problem.

AI & Medicine · Field Notes

The Second Opinion: When an AI Agent Sat on the Tumor Board

A hematology decision-support agent was tested the hard way — not on old charts, but on real cases that hadn't been decided yet, at hospitals that had never trained it. That combination is rarer in clinical AI than it has any right to be.

July 2, 2026 · Lisa Pedrosa · ~10 min read · AI · Medicine

Picture the room: a windowless conference space on the seventh floor of a university hospital, a long table, a wall monitor throwing pale light across a dozen tired faces. A hematologist reads out a case — 61-year-old woman, relapsed acute myeloid leukemia, a genomic panel bristling with mutations that don't play nicely together. Around the table: an oncologist, a pathologist, a genomicist, a pharmacist, a data manager. And, projected quietly on a second screen that nobody quite looks at directly, a recommendation that was generated hours earlier by software — before anyone in the room had said a word. By the time the discussion ends, the room's decision and the machine's suggestion match. This has now happened often enough, in enough hospitals, on enough real and not-yet-decided cases, that a paper describing it was published in Nature Microbiology on June 30, 2026 — and it is one of the first clinical-AI studies in oncology to have earned the right to make that claim without an asterisk.

What a tumor board actually is, and why hematology makes it brutal

A tumor board — sometimes called a multidisciplinary team meeting, sometimes a molecular tumor board when genomics are central — is the closest thing modern oncology has to a jury. Specialists from different disciplines convene, usually weekly, to review complicated cancer cases and agree on a plan: which drug, which trial, which sequence of therapies, whether to biopsy again, whether to stop treating altogether. It is slow, expensive, and utterly dependent on the specific mix of expertise sitting in the room that day.

Hematologic malignancies make this harder than almost any other cancer type. Leukemias, lymphomas, and myelomas are not lumps you can biopsy once and file away. They are moving targets: clonal populations that evolve under treatment pressure, respond to therapy in weeks rather than months, and get re-classified by the World Health Organization every few years as the underlying molecular biology comes into sharper focus. A single patient's case file might include serial bone marrow biopsies, flow cytometry, cytogenetics, a rotating cast of targeted-panel sequencing results, and a treatment history that branches every time the disease relapses. Asking a language model to reason over that is not like asking it to summarize a radiology report. It is like asking it to follow a novel with an unreliable narrator, written in five specialist dialects, where the plot keeps changing based on decisions made in earlier chapters.

That is precisely the setting in which the new study, led by researchers including Mirco J. Friedrich and colleagues, chose to test a locally deployable AI agent — one designed to be run inside a hospital's own infrastructure rather than piped through an external API, and grounded in the actual case record rather than general medical knowledge alone.

Retrospective is easy. External and prospective are the real test.

Here is the detail that separates this paper from the flood of "AI matches doctors" headlines that have become almost a genre unto themselves: the agent wasn't just checked against old, already-decided cases at the hospital where it was built. It was validated three ways — retrospectively, externally, and prospectively — and it is the second and third of those that matter most.

Retrospective validation means testing a model against historical cases whose outcomes are already known. It's useful, but it's also the easiest bar to clear, because the model, the researchers, and sometimes the training data itself all exist downstream of that history. External validation means testing the same model on cases from a different institution — one whose patient population, documentation habits, lab equipment, and clinical culture the model has never seen. Prospective validation is the rarest and hardest: testing the model on cases that have not yet been decided, in real time, so that its recommendation is locked in before the human tumor board deliberates and reaches its own conclusion. There is no way to retrofit a prospective study after the fact. The model has to commit to an answer before anyone in the room knows what the "right" answer will turn out to be.

Running all three checks on the same system, across multiple hospitals, is uncommon enough in clinical AI that it's worth pausing on. Most published studies — including ones that make confident claims about matching physician judgment — never leave the retrospective stage. This one did, and it did so specifically in hematology, a subfield where the case complexity gives a model every opportunity to fail quietly.

The uncomfortable truth this exposes: the overwhelming majority of "AI beats doctors" studies in oncology never test prospectively at all. They benchmark against a static, already-adjudicated dataset — which means the model is graded on an exam it can't actually fail in a way that matters. A prospective, external evaluation is closer to putting the system on call.

Context: a validation problem this site has already flagged

Regular readers may recall The Validation Gap, which covered a separate Nature Medicine benchmark from earlier this June showing general-purpose chatbots outperforming FDA-cleared clinical AI tools on certain tasks — and the uncomfortable regulatory question that raised about what "cleared" even certifies. That story was about the gap between regulatory approval and real-world competence. This one is different, though it rhymes: it's about the gap between what most clinical-AI research calls "validated" and what validation would need to mean to trust a system with a real, undecided case. The hematology tumor board study doesn't resolve the FDA-clearance question. But it does offer something the field has been short on — a working example of what clearing the higher bar actually looks like.

87.5%

Tool-use accuracy, Nature Cancer 2025 agent

91.0%

Correct clinical conclusions reached

75.5%

Accurate oncology guideline citation

+11.2%

Longitudinal reasoning gain from agentic tool use, MTBBench

Those first three figures come from a separate, earlier Nature Cancer study (2025) that is worth holding up as a comparison point rather than a duplicate. Researchers built an autonomous AI agent — powered by GPT-4 plus a toolkit of vision transformers and image-segmentation models — and tested it on 20 realistic multimodal oncology cases. The agent used the correct clinical tool at the right moment 87.5% of the time, reached the correct overall clinical conclusion in 91.0% of cases, and correctly cited the relevant oncology guideline 75.5% of the time. For context, GPT-4 alone, without the tool-using agent scaffold, scored roughly 30% on comparable decision accuracy — jumping to around 87% once wrapped in an agentic framework that could call the right specialist tools. That gap, between a raw model and the same model wired into a workflow, is a big part of why "agentic" has become the operative word in this field: the intelligence isn't only in the underlying model, it's in the scaffolding that tells it when to look something up, when to defer, and when to check its own work.

The skeptic's rebuttal: MTBBench and the hallucination problem

None of this should read as an all-clear for AI oncology. The most sobering counterweight arrived via MTBBench, a 2025–2026 benchmark built specifically to simulate molecular-tumor-board-style reasoning across genuinely difficult cases — the kind involving imaging, lab values, pathology slides, genomic panels, and free text, all unfolding across multiple points in time, sometimes with data that contradicts itself the way real medical records do. MTBBench's dataset spans 66 patient cases and 573 expert-validated question-answer pairs, and its authors were blunt about what they found: even large, capable models are, in their words, unreliable. They hallucinate frequently. They struggle to reason across time-resolved data. They fumble when evidence from different modalities doesn't neatly agree. In some of the benchmark's toughest tasks, baseline agents fabricated file names that didn't exist, failed to retrieve genuinely new critical documents, or kept reusing stale context from earlier in a case — a failure mode that shows up most often in exactly the tasks a hematology tumor board cares about most: predicting outcomes and catching recurrence.

There is a silver lining buried in that same paper. When MTBBench's authors wrapped baseline models in an agentic framework equipped with foundation-model-based tools — giving the system structured ways to query images, retrieve records, and cross-check claims rather than relying on free-form generation — performance improved substantially: up to a 9.0% gain on multimodal reasoning tasks and up to 11.2% on longitudinal, time-resolved reasoning. That's the same underlying lesson as the Nature Cancer figures above: bare language models are not the story. Language models embedded in carefully engineered agent architectures, with tools, retrieval, and guardrails, are a meaningfully different and more capable story — though evidently still an incomplete one.

Two other Nature-family studies published this year make a related point from a different angle: agentic AI systems can meaningfully assist decision-making across the diagnosis, treatment, and hospital-admission stages of patient care. But in both cases, the researchers stopped short of calling either model ready for real-world clinical deployment. Promising performance on a benchmark and readiness to be trusted with an actual patient are not the same claim, and the field's more careful voices have been increasingly insistent about not conflating them.

Current LLMs, even at scale, lack the reliability required for real-world oncology decision support — frequently hallucinating and struggling to reason over time-resolved, multimodal, and sometimes conflicting clinical evidence.

— Paraphrased finding, MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology (arXiv 2511.20490)

So which is it — ready, or not?

Both, in the sense that matters. The Nature Microbiology hematology study and MTBBench are not contradicting each other; they're describing different points on the same map. MTBBench is a stress test built to expose failure across the messiest, most adversarial version of the problem — long time horizons, conflicting modalities, cases explicitly chosen to be hard. The hematology tumor board agent, by contrast, was built and evaluated for a narrower, more disciplined task: producing a case-grounded recommendation that a real hematology team, working from the same chart, would independently reach. It is not a general oncology reasoning engine. It is a specific tool, tested against a specific, well-defined form of agreement, at more than one hospital, on cases that hadn't happened yet.

That narrowness is not a weakness — it may be the actual lesson. The systems that are approaching genuine clinical readiness in 2026 are not the ones promising to reason about everything. They're the ones with a tightly scoped job description, a validation protocol that includes prospective and external testing as table stakes rather than a stretch goal, and a resulting concordance rate that a hospital's own credentialing committee could actually evaluate on its own terms.

What oncologists and hematologists are actually saying

The published reaction from clinicians to systems like this has been notably more measured than the press coverage. The consistent theme across commentary on both the Nature Cancer 2025 agent and this year's hematology work is that these tools are being framed as decision support, not decision replacement — a second opinion that arrives before the human second opinion, not instead of it. Clinicians interviewed around comparable studies have repeatedly emphasized that a concordance rate, however high, describes agreement with what a board already decided or was about to decide; it does not, by itself, establish that the AI's underlying reasoning was sound, or that it would have caught the one-in-twenty case where the tumor board itself got it wrong. Encouragingly, this is also the argument researchers make for why prospective, external testing matters so much: it is the only design that even gives a wrong human decision the chance to diverge from a right AI one, and vice versa, in a way retrospective benchmarking structurally cannot.

The open questions nobody has fully answered

Even a well-validated narrow success leaves real questions unresolved. Liability sits at the top of the list: if a hospital deploys a locally run agent that participates in a hematology tumor board's deliberation and a patient is harmed, does responsibility sit with the treating physician, the hospital's IT and compliance apparatus, or the model's developers — and does the answer change depending on whether the human team deviated from or followed the AI's suggestion? Over-reliance is the second concern, the one clinicians raise most often in private: a highly concordant tool, used for long enough, risks becoming a tool nobody double-checks, which is precisely the condition under which its failure modes — the same hallucination and reasoning gaps MTBBench documented — become dangerous rather than merely embarrassing. And edge cases remain the hardest problem of all, almost by definition: rare hematologic subtypes, patients with unusual comorbidities, genomic findings that don't fit any existing guideline cleanly, are exactly the cases where concordance with a historical tumor board pattern is least informative, because there may be no clear pattern to match.

Fig. 1 — Two validation tiers in clinical AI research, and where the June 2026 hematology tumor board study sits.

What happens next

If this study represents anything larger than one well-run trial, it's a template. The next wave of clinical-AI papers worth taking seriously will likely be judged less by their headline accuracy number and more by which of these three validation boxes they can honestly check: retrospective, external, prospective. Expect more hospitals to start running locally deployable agents in shadow mode — quietly generating recommendations that clinicians don't see until after they've made their own call — specifically so that prospective, blinded comparisons become routine rather than remarkable. Expect regulators, still digesting the questions raised by the FDA-clearance gap covered elsewhere on this site, to start asking not just "does it work" but "how was it tested, and by whom, and on whose patients." And expect the skeptics behind benchmarks like MTBBench to keep building harder tests, because a field that only measures its successes against yesterday's easy cases will eventually be surprised by tomorrow's hard one. The real second opinion here may not be the AI's. It's the one the research community keeps giving itself, by refusing to grade its own homework.

Sources

Friedrich, M. J. et al. "A locally deployable, case-grounded large language model agent for hematology tumor boards." Nature Microbiology, June 30, 2026. nature.com/nmicrobiol
MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology. arXiv:2511.20490. arxiv.org/abs/2511.20490
MTBBench full text (HTML). arxiv.org/html/2511.20490v1
MTBBench PDF. arxiv.org/pdf/2511.20490
MTBBench summary, Emergent Mind. emergentmind.com/topics/mtbbench
Ferber, D. et al. "Development and validation of an autonomous artificial intelligence agent for clinical decision-making in oncology." Nature Cancer, 2025. nature.com/articles/s43018-025-00991-6
Ferber et al., PMC full text. pmc.ncbi.nlm.nih.gov/articles/PMC12380607
"Autonomous AI agent can support clinical decisions in oncology." Medical Xpress, 2025 news coverage of the Nature Cancer study. medicalxpress.com
Discoveries in Health Policy, commentary on Ferber et al. discoveriesinhealthpolicy.com
"The AI paradox in precision oncology: Prospective blinded validation of large language models against molecular tumor board." Journal of Clinical Oncology, 2026. ascopubs.org
"Clinical evaluation of large language model recommendations in melanoma: comparison with multidisciplinary tumor board decisions in a real-world cohort." Frontiers in Oncology, 2026. frontiersin.org
"Tumor Board-Inspired Multiagent Artificial Intelligence System for Interpreting Oncology Guidelines." PubMed, 2026. pubmed.ncbi.nlm.nih.gov/41499718
"Artificial intelligence agents in cancer research and oncology." PubMed, 2026. pubmed.ncbi.nlm.nih.gov/41526721
Hematology and Immune Engineering, German Cancer Research Center (DKFZ) — research group context. dkfz.de/en/hematology-and-immune-engineering
Nature Medicine's June 2026 benchmark study on general-purpose LLMs vs. FDA-cleared clinical AI (context for "The Validation Gap"). Clinical Trial Vanguard. clinicaltrialvanguard.com

The Second Opinion: When an AI Agent Sat on the Tumor Board

What a tumor board actually is, and why hematology makes it brutal

Retrospective is easy. External and prospective are the real test.

Context: a validation problem this site has already flagged

The skeptic's rebuttal: MTBBench and the hallucination problem

So which is it — ready, or not?

What oncologists and hematologists are actually saying

The open questions nobody has fully answered

What happens next

Sources

Related reading

The Validation Gap

The Robodoc

The Co-Scientist

AI Melanoma Screening

MethylScan Cancer Blood Test

AI Drug Discovery