Table of Contents
Fetching ...

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

Yichi Zhang, Feiyang Xiao, Le Xue, Wenbo Zhang, Gang Feng, Chenguang Zheng, Yuan Qi, Yuan Cheng, Zixin Hu

TL;DR

The paper addresses the gap that 3D medical foundation models validated on structural imaging may not generalize to functional modalities like PET. It introduces an intra-subject, paired benchmark (UMD) with 490 PET/CT and 464 PET/MRI scans across 13 organs, enabling direct cross-modality evaluation without confounding morphological changes. The authors systematically compare five state-of-the-art 3D segmentation approaches under zero-shot prompting, revealing a severe modality discrepancy and a generalization illusion that high benchmark scores fail to predict real-world performance. These findings underscore the need for multimodal training and evaluation in medical foundation models, and the UMD dataset provides a foundational resource to guide the development of truly modality-agnostic clinical tools.

Abstract

While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

TL;DR

The paper addresses the gap that 3D medical foundation models validated on structural imaging may not generalize to functional modalities like PET. It introduces an intra-subject, paired benchmark (UMD) with 490 PET/CT and 464 PET/MRI scans across 13 organs, enabling direct cross-modality evaluation without confounding morphological changes. The authors systematically compare five state-of-the-art 3D segmentation approaches under zero-shot prompting, revealing a severe modality discrepancy and a generalization illusion that high benchmark scores fail to predict real-world performance. These findings underscore the need for multimodal training and evaluation in medical foundation models, and the UMD dataset provides a foundational resource to guide the development of truly modality-agnostic clinical tools.

Abstract

While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans (675k 2D images, 12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.
Paper Structure (5 sections, 4 figures, 3 tables)

This paper contains 5 sections, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The structural bias in data distribution of general-purpose medical segmentation foundation models. A profound disparity is observed between structural imaging (CT and MRI) and functional imaging (PET), with the latter constituting a negligible fraction of the total data across all models.
  • Figure 2: (a) Existing validation protocols typically assess models on heterogeneous datasets where modality is intrinsically entangled with specific anatomical tasks. This approach prevents an isolated measurement of modality-specific robustness, as performance variations are confounded by varying task complexities. (b) In contrast, our evaluation utilizes paired whole-body PET/CT and PET/MRI data. By performing simultaneous segmentation of identical targets across paired structural and functional modalities, we effectively isolates the Modality Discrepancy, enabling a direct and rigorous quantification.
  • Figure 3: Visual comparison of failure cases of SegVol with performance across in-domain and out-of-domain UMD datasets. This stark illustrates the generalization illusion, where high scores on benchmark datasets fail to translate into robust clinical utility on unseen distributions.
  • Figure 4: Qualitative assessment of modality discrepancy via nnInteractive underscores a fundamental performance gap. Despite exhibiting the highest relative generalization ability in our evaluation, the model remains incapable of mapping anatomical priors onto the distinct signal profiles of metabolic imaging.