MultiMed: Massively Multimodal and Multitask Medical Understanding

Shentong Mo; Paul Pu Liang

MultiMed: Massively Multimodal and Multitask Medical Understanding

Shentong Mo, Paul Pu Liang

TL;DR

MultiMed tackles the scarcity of large-scale, diverse multimodal medical datasets by introducing a benchmark with 2.56 million samples across 10 modalities and 11 tasks to evaluate unimodal, multimodal, and multitask learning. The authors formalize notations and fusion strategies, and demonstrate that multimodal multitask models achieve superior performance and robustness, including zero-shot and few-shot generalization, across a wide range of medical problems. Key contributions include the dataset design with organ/cell, modality, and task diversity; comprehensive experiments showing clear gains from modality integration; and analyses of generalization, robustness, and novel modality combinations with implications for personalized medicine and clinical decision support. The work positions MultiMed as a scalable, community-driven platform for advancing generalist biomedical AI, with attention to potential biases, fairness, and real-world deployment considerations.

Abstract

Biomedical data is inherently multimodal, consisting of electronic health records, medical imaging, digital pathology, genome sequencing, wearable sensors, and more. The application of artificial intelligence tools to these multifaceted sensing technologies has the potential to revolutionize the prognosis, diagnosis, and management of human health and disease. However, current approaches to biomedical AI typically only train and evaluate with one or a small set of medical modalities and tasks. This limitation hampers the development of comprehensive tools that can leverage the rich interconnected information across many heterogeneous biomedical sensors. To address this challenge, we present MultiMed, a benchmark designed to evaluate and enable large-scale learning across a wide spectrum of medical modalities and tasks. MultiMed consists of 2.56 million samples across ten medical modalities such as medical reports, pathology, genomics, and protein data, and is structured into eleven challenging tasks, including disease prognosis, protein structure prediction, and medical question answering. Using MultiMed, we conduct comprehensive experiments benchmarking state-of-the-art unimodal, multimodal, and multitask models. Our analysis highlights the advantages of training large-scale medical models across many related modalities and tasks. Moreover, MultiMed enables studies of generalization across related medical concepts, robustness to real-world noisy data and distribution shifts, and novel modality combinations to improve prediction performance. MultiMed will be publicly available and regularly updated and welcomes inputs from the community.

MultiMed: Massively Multimodal and Multitask Medical Understanding

TL;DR

Abstract

Paper Structure (41 sections, 7 equations, 7 figures, 2 tables)

This paper contains 41 sections, 7 equations, 7 figures, 2 tables.

Introduction
Related Work
MultiMed: A Massively Multimodal and Multitask Medical Benchmark
Organ & cell type diversity
Modality diversity
Task diversity
Medical AI Methods Benchmarked in MultiMed
Notations
Unimodal single-task and multitask learning
Multimodal fusion methods
Multimodal and multitask learning
Experiments
Experimental setup
Evaluation metrics.
Implementation details and computation.
...and 26 more sections

Figures (7)

Figure 1: MultiMed is a large-scale benchmark for representation learning in the medical domain, consisting of 2.56M samples, 10 rich modalities, and 11 challenging tasks in real-world medical scenarios. We also present new challenges for impactful applications involving text, OCT, X-ray, CT, MRI, Pathology, EEG, genomics, scRNA-seq, and proteins. The lines represent modality pairings present in individual MultiMed datasets, such as those between text and MRI, text and pathology, as well as text, genomics, and proteins among others.
Figure 2: Plots of organ out-of-distribution (top) and cell out-of-distribution (bottom) results. Our multimodal multi-task models retain strong performance for varying organ and cell distributions.
Figure 3: Visualizations of modality combination across for disease classification, protein structure prediction, and gene expression prediction. We find that for disease classification, the most optimal combination is the 3 imaging modalities X-ray, CT, and MRI; for protein structure prediction unimodal protein models are sufficient; and for gene expression prediction bimodal fusion between DNA and scRNA-seq performs best. These modality combinations were previously unexplored in the literature.
Figure 4: Visualizations of OCT samples.
Figure 5: Visualizations of pathology samples.
...and 2 more figures

MultiMed: Massively Multimodal and Multitask Medical Understanding

TL;DR

Abstract

MultiMed: Massively Multimodal and Multitask Medical Understanding

Authors

TL;DR

Abstract

Table of Contents

Figures (7)