Table of Contents
Fetching ...

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

Kiril Vasilev, Alexandre Misrahi, Eeshaan Jain, Phil F Cheng, Petros Liakopoulos, Olivier Michielin, Michael Moor, Charlotte Bunne

TL;DR

MTBBench presents a multimodal, longitudinal benchmark for oncology that simulates Molecular Tumor Board decision-making and validates data with expert input. It introduces an agentic framework that allows models to iteratively retrieve and reason over heterogeneous data, using both foundation-model tools and external knowledge bases (PubMed, DrugBank) to support complex, time-evolving clinical questions. Empirical evaluation shows baseline LLMs struggle with reliability and temporal, cross-modal reasoning, while tool-augmented agents achieve meaningful gains (up to 9.0% multi-modal and 11.2% longitudinal) by leveraging domain-specific models and information sources. The work contributes dataset- and tool-rich infrastructure, enabling rigorous assessment of multimodal, longitudinal clinical reasoning and highlighting directions for integrating medical foundation models into realistic MTB workflows.

Abstract

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

MTBBench: A Multimodal Sequential Clinical Decision-Making Benchmark in Oncology

TL;DR

MTBBench presents a multimodal, longitudinal benchmark for oncology that simulates Molecular Tumor Board decision-making and validates data with expert input. It introduces an agentic framework that allows models to iteratively retrieve and reason over heterogeneous data, using both foundation-model tools and external knowledge bases (PubMed, DrugBank) to support complex, time-evolving clinical questions. Empirical evaluation shows baseline LLMs struggle with reliability and temporal, cross-modal reasoning, while tool-augmented agents achieve meaningful gains (up to 9.0% multi-modal and 11.2% longitudinal) by leveraging domain-specific models and information sources. The work contributes dataset- and tool-rich infrastructure, enabling rigorous assessment of multimodal, longitudinal clinical reasoning and highlighting directions for integrating medical foundation models into realistic MTB workflows.

Abstract

Multimodal Large Language Models (LLMs) hold promise for biomedical reasoning, but current benchmarks fail to capture the complexity of real-world clinical workflows. Existing evaluations primarily assess unimodal, decontextualized question-answering, overlooking multi-agent decision-making environments such as Molecular Tumor Boards (MTBs). MTBs bring together diverse experts in oncology, where diagnostic and prognostic tasks require integrating heterogeneous data and evolving insights over time. Current benchmarks lack this longitudinal and multimodal complexity. We introduce MTBBench, an agentic benchmark simulating MTB-style decision-making through clinically challenging, multimodal, and longitudinal oncology questions. Ground truth annotations are validated by clinicians via a co-developed app, ensuring clinical relevance. We benchmark multiple open and closed-source LLMs and show that, even at scale, they lack reliability -- frequently hallucinating, struggling with reasoning from time-resolved data, and failing to reconcile conflicting evidence or different modalities. To address these limitations, MTBBench goes beyond benchmarking by providing an agentic framework with foundation model-based tools that enhance multi-modal and longitudinal reasoning, leading to task-level performance gains of up to 9.0% and 11.2%, respectively. Overall, MTBBench offers a challenging and realistic testbed for advancing multimodal LLM reasoning, reliability, and tool-use with a focus on MTB environments in precision oncology.

Paper Structure

This paper contains 73 sections, 10 figures, 3 tables.

Figures (10)

  • Figure 1: The MTBBench benchmark and agent framework.a. MTBBench simulates molecular tumor board workflows, presenting agents with longitudinal, multi-modal patient data (H&E, IHC, hematology, and genomics) along with temporally distributed clinical events. Agents are tasked with integrating this information to support complex decision-making. b. MTBBench allows benchmarking agents on their ability to reason across modalities and time in order to accurately tackle clinical questions concerning diagnosis, prognosis, and biomarker interpretation. Lastly, we introduce an agentic framework that enables querying both external tools and pretrained foundation models, allowing agents to more effectively reason over complex, multi-modal and temporally resolved clinical information.
  • Figure 2: Accuracy vs. average number of files accessed per question. Analyzed across tasks for multi-modal understanding (a–c) and longitudinal reasoning (d–f). Each point represents a model evaluated on a specific task across all patients. Dots indicate model sizes (gpt-4o's size has been reduced for visibility). Higher file access generally correlates with increased accuracy, highlighting the importance of cross-modality and temporal integration for performance.
  • Figure 3: Accuracy across models and tasks for naive and tool-augmented agents. For multi-modal (a.–c.) and longitudinal (d.–f.) evaluation. Models equipped with tool access (hatched bars) generally show improved accuracy, highlighting the benefit of querying external resources in both multi-modal and temporal settings.
  • Figure 4: Overview of the Molecular Tumor Board process. Meetings are held at regular intervals or on demand, bringing together multidisciplinary experts who jointly review patient history, molecular profiling results, and clinical evidence to recommend personalized treatment strategies. Final decisions are communicated in writing to the treating physician. Figure adapted from Tsimberidou2023.
  • Figure 5: Companion app interface for clinical validation. The platform displays clinical context, reference images grouped by region and marker, and multiple-choice questions for expert review. Full-resolution slide viewers and inline feedback fields allow for efficient validation of benchmark items.
  • ...and 5 more figures