Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

Yichi Zhang; Feiyang Xiao; Le Xue; Wenbo Zhang; Gang Feng; Chenguang Zheng; Yuan Qi; Yuan Cheng; Zixin Hu

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

Yichi Zhang, Feiyang Xiao, Le Xue, Wenbo Zhang, Gang Feng, Chenguang Zheng, Yuan Qi, Yuan Cheng, Zixin Hu

TL;DR

The paper addresses the gap that 3D medical foundation models validated on structural imaging may not generalize to functional modalities like PET. It introduces an intra-subject, paired benchmark (UMD) with 490 PET/CT and 464 PET/MRI scans across 13 organs, enabling direct cross-modality evaluation without confounding morphological changes. The authors systematically compare five state-of-the-art 3D segmentation approaches under zero-shot prompting, revealing a severe modality discrepancy and a generalization illusion that high benchmark scores fail to predict real-world performance. These findings underscore the need for multimodal training and evaluation in medical foundation models, and the UMD dataset provides a foundational resource to guide the development of truly modality-agnostic clinical tools.

Abstract

While emerging 3D medical foundation models are envisioned as versatile tools with offer general-purpose capabilities, their validation remains largely confined to regional and structural imaging, leaving a significant modality discrepancy unexplored. To provide a rigorous and objective assessment, we curate the UMD dataset comprising 490 whole-body PET/CT and 464 whole-body PET/MRI scans ($\sim$675k 2D images, $\sim$12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

TL;DR

Abstract

675k 2D images,

12k 3D organ annotations) and conduct a thorough and comprehensive evaluation of representative 3D segmentation foundation models. Through intra-subject controlled comparisons of paired scans, we isolate imaging modality as the primary independent variable to evaluate model robustness in real-world applications. Our evaluation reveals a stark discrepancy between literature-reported benchmarks and real-world efficacy, particularly when transitioning from structural to functional domains. Such systemic failures underscore that current 3D foundation models are far from achieving truly general-purpose status, necessitating a paradigm shift toward multi-modal training and evaluation to bridge the gap between idealized benchmarking and comprehensive clinical utility. This dataset and analysis establish a foundational cornerstone for future research to develop truly modality-agnostic medical foundation models.

Paper Structure (5 sections, 4 figures, 3 tables)

This paper contains 5 sections, 4 figures, 3 tables.

Introduction
Validation Pitfalls
Benchmark Design
Results and Discussion
Conclusion

Figures (4)

Figure 1: The structural bias in data distribution of general-purpose medical segmentation foundation models. A profound disparity is observed between structural imaging (CT and MRI) and functional imaging (PET), with the latter constituting a negligible fraction of the total data across all models.
Figure 2: (a) Existing validation protocols typically assess models on heterogeneous datasets where modality is intrinsically entangled with specific anatomical tasks. This approach prevents an isolated measurement of modality-specific robustness, as performance variations are confounded by varying task complexities. (b) In contrast, our evaluation utilizes paired whole-body PET/CT and PET/MRI data. By performing simultaneous segmentation of identical targets across paired structural and functional modalities, we effectively isolates the Modality Discrepancy, enabling a direct and rigorous quantification.
Figure 3: Visual comparison of failure cases of SegVol with performance across in-domain and out-of-domain UMD datasets. This stark illustrates the generalization illusion, where high scores on benchmark datasets fail to translate into robust clinical utility on unseen distributions.
Figure 4: Qualitative assessment of modality discrepancy via nnInteractive underscores a fundamental performance gap. Despite exhibiting the highest relative generalization ability in our evaluation, the model remains incapable of mapping anatomical priors onto the distinct signal profiles of metabolic imaging.

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

TL;DR

Abstract

Uncovering Modality Discrepancy and Generalization Illusion for General-Purpose 3D Medical Segmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (4)