Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation
Yichi Zhang, Feiyang Xiao, Le Xue, Wenbo Zhang, Gang Feng, Chenguang Zheng, Yuan Qi, Yuan Cheng, Zixin Hu
TL;DR
The paper addresses the gap that 3D medical foundation models validated on structural imaging may not generalize to functional modalities like PET. It introduces an intra-subject, paired benchmark (UMD) with 490 PET/CT and 464 PET/MRI scans across 13 organs, enabling direct cross-modality evaluation without confounding morphological changes. The authors systematically compare five state-of-the-art 3D segmentation approaches under zero-shot prompting, revealing a severe modality discrepancy and a generalization illusion that high benchmark scores fail to predict real-world performance. These findings underscore the need for multimodal training and evaluation in medical foundation models, and the UMD dataset provides a foundational resource to guide the development of truly modality-agnostic clinical tools.
Abstract
While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.
