Table of Contents
Fetching ...

BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors

Kathryn Wantlin, Chenwei Wu, Shih-Cheng Huang, Oishi Banerjee, Farah Dadabhoy, Veeral Vipin Mehta, Ryan Wonhee Han, Fang Cao, Raja R. Narayan, Errol Colak, Adewole Adamson, Laura Heacock, Geoffrey H. Tison, Alex Tamkin, Pranav Rajpurkar

TL;DR

BenchMD introduces a comprehensive benchmark for unified learning across medical modalities by aggregating 19 public datasets over 7 modalities and evaluating modality-agnostic architectures and training strategies under few-shot and distribution-shift conditions. It systematically compares SSL methods, ImageNet pretraining, and scratch baselines using a transformer-based, modality-agnostic encoder, with rigorous ID and zero-shot OOD evaluations across 1D, 2D, and 3D medical data. The study finds no single approach performs best across all modalities, with ImageNet pretraining excelling in several 2D tasks while SSL methods show modality-specific strengths and limitations; results also highlight sensitivity to label availability and distribution shifts. By providing standardized preprocessing and a public codebase, BenchMD aims to accelerate the development of universal, generalizable medical AI models and facilitate fair, cross-modal comparisons.

Abstract

Medical data poses a daunting challenge for AI algorithms: it exists in many different modalities, experiences frequent distribution shifts, and suffers from a scarcity of examples and labels. Recent advances, including transformers and self-supervised learning, promise a more universal approach that can be applied flexibly across these diverse conditions. To measure and drive progress in this direction, we present BenchMD: a benchmark that tests how well unified, modality-agnostic methods, including architectures and training techniques (e.g. self-supervised learning, ImageNet pretraining),perform on a diverse array of clinically-relevant medical tasks. BenchMD combines 19 publicly available datasets for 7 medical modalities, including 1D sensor data, 2D images, and 3D volumetric scans. Our benchmark reflects real-world data constraints by evaluating methods across a range of dataset sizes, including challenging few-shot settings that incentivize the use of pretraining. Finally, we evaluate performance on out-of-distribution data collected at different hospitals than the training data, representing naturally-occurring distribution shifts that frequently degrade the performance of medical AI models. Our baseline results demonstrate that no unified learning technique achieves strong performance across all modalities, leaving ample room for improvement on the benchmark. Code is released at https://github.com/rajpurkarlab/BenchMD.

BenchMD: A Benchmark for Unified Learning on Medical Images and Sensors

TL;DR

BenchMD introduces a comprehensive benchmark for unified learning across medical modalities by aggregating 19 public datasets over 7 modalities and evaluating modality-agnostic architectures and training strategies under few-shot and distribution-shift conditions. It systematically compares SSL methods, ImageNet pretraining, and scratch baselines using a transformer-based, modality-agnostic encoder, with rigorous ID and zero-shot OOD evaluations across 1D, 2D, and 3D medical data. The study finds no single approach performs best across all modalities, with ImageNet pretraining excelling in several 2D tasks while SSL methods show modality-specific strengths and limitations; results also highlight sensitivity to label availability and distribution shifts. By providing standardized preprocessing and a public codebase, BenchMD aims to accelerate the development of universal, generalizable medical AI models and facilitate fair, cross-modal comparisons.

Abstract

Medical data poses a daunting challenge for AI algorithms: it exists in many different modalities, experiences frequent distribution shifts, and suffers from a scarcity of examples and labels. Recent advances, including transformers and self-supervised learning, promise a more universal approach that can be applied flexibly across these diverse conditions. To measure and drive progress in this direction, we present BenchMD: a benchmark that tests how well unified, modality-agnostic methods, including architectures and training techniques (e.g. self-supervised learning, ImageNet pretraining),perform on a diverse array of clinically-relevant medical tasks. BenchMD combines 19 publicly available datasets for 7 medical modalities, including 1D sensor data, 2D images, and 3D volumetric scans. Our benchmark reflects real-world data constraints by evaluating methods across a range of dataset sizes, including challenging few-shot settings that incentivize the use of pretraining. Finally, we evaluate performance on out-of-distribution data collected at different hospitals than the training data, representing naturally-occurring distribution shifts that frequently degrade the performance of medical AI models. Our baseline results demonstrate that no unified learning technique achieves strong performance across all modalities, leaving ample room for improvement on the benchmark. Code is released at https://github.com/rajpurkarlab/BenchMD.
Paper Structure (44 sections, 3 figures, 13 tables)

This paper contains 44 sections, 3 figures, 13 tables.

Figures (3)

  • Figure 1: The BenchMD benchmark consists of 19 real-world medical datasets across 7 medical modalities. Successful methods will achieve high performance when evaluated on out-of-distribution data.
  • Figure 2: Models for each modality are first trained on a source dataset, using unified methods across modalities. They are then evaluated on out-of-distribution data from one or more target datasets.
  • Figure 3: The in-distribution and out-of-distribution performance of models across modalities. OOD performance is averaged across target dataset(s).