Table of Contents
Fetching ...

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

Michelle Stegeman, Lena Philipp, Fennie van der Graaf, Marina D'Amato, Clément Grisi, Luc Builtjes, Joeran S. Bosma, Judith Lefkes, Rianne A. Weber, James A. Meakin, Thomas Koopman, Anne Mickan, Mathias Prokop, Ewoud J. Smit, Geert Litjens, Jeroen van der Laak, Bram van Ginneken, Maarten de Rooij, Henkjan Huisman, Colin Jacobs, Francesco Ciompi, Alessa Hering

TL;DR

UNICORN is introduced, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol that standardizes multi-task, multi-modality assessment and establishes a foundation for reproduced benchmarking of medical foundation models.

Abstract

Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.

Designing UNICORN: a Unified Benchmark for Imaging in Computational Pathology, Radiology, and Natural Language

TL;DR

UNICORN is introduced, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol that standardizes multi-task, multi-modality assessment and establishes a foundation for reproduced benchmarking of medical foundation models.

Abstract

Medical foundation models show promise to learn broadly generalizable features from large, diverse datasets. This could be the base for reliable cross-modality generalization and rapid adaptation to new, task-specific goals, with only a few task-specific examples. Yet, evidence for this is limited by the lack of public, standardized, and reproducible evaluation frameworks, as existing public benchmarks are often fragmented across task-, organ-, or modality-specific settings, limiting assessment of cross-task generalization. We introduce UNICORN, a public benchmark designed to systematically evaluate medical foundation models under a unified protocol. To isolate representation quality, we built the benchmark on a novel two-step framework that decouples model inference from task-specific evaluation based on standardized few-shot adaptation. As a central design choice, we constructed indirectly accessible sequestered test sets derived from clinically relevant cohorts, along with standardized evaluation code and a submission interface on an open benchmarking platform. Performance is aggregated into a single UNICORN Score, a new metric that we introduce to support direct comparison of foundation models across diverse medical domains, modalities, and task types. The UNICORN test dataset includes data from more than 2,400 patients, including over 3,700 vision cases and over 2,400 clinical reports collected from 17 institutions across eight countries. The benchmark spans eight anatomical regions and four imaging modalities. Both task-specific and aggregated leaderboards enable accessible, standardized, and reproducible evaluation. By standardizing multi-task, multi-modality assessment, UNICORN establishes a foundation for reproducible benchmarking of medical foundation models. Data, baseline methods, and the evaluation platform are publicly available via unicorn.grand-challenge.org.
Paper Structure (27 sections, 4 equations, 3 figures, 2 tables)

This paper contains 27 sections, 4 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the 20 UNICORN benchmark tasks. The benchmark includes 20 tasks, of which 15 focused on eight different anatomic regions, whereas the remaining five are broad-scope tasks that span multiple regions or represent specific medical processes.
  • Figure 2: Reference labels in the UNICORN validation set. For classification tasks, T1, T2, T4, T12-T16, bar charts show the proportion of cases per label. For T16, which contains seven independent binary labels, each bar represents the proportion of reports labeled True, with the remainder corresponding to the proportion of False labels. Regression tasks, T3, T17, T18, are summarized with boxplots of target values, detection tasks, T5-T8, with histograms of object counts, and segmentation tasks, T9-T11, with bar charts showing either the proportion of each class for multiclass tasks, or the proportion of object volume for single class tasks. For named entity recognition, T19, bars show the distribution of target categories, and for caption generation, T20, the distribution of caption lengths is shown, few-shots are not available for this task.
  • Figure 3: UNICORN benchmarking pipeline The pipeline is structured vertically from data storage to metrics reporting, with modality-specific differences shown horizontally for vision, language, and vision--language. Vision tasks: Each case is processed by the Algorithm container to extract generic representations using pre-trained foundation models. These representations are passed to the Evaluation container, where a lightweight adaptor trained on labeled few-shot examples produces task-specific predictions and computes evaluation metrics. Language tasks: All labeled few-shot cases and evaluation cases are together provided to the Algorithm container, which generates predictions that are evaluated in the Evaluation container. Vision-language task: Each case is processed individually by the Algorithm container, which uses the textual task description to generate a textual prediction that is evaluated in the Evaluation container. For all tasks, the resulting metrics are reported on their respective leaderboards on Grand Challenge.