M3DA: Benchmark for Unsupervised Domain Adaptation in 3D Medical Image Segmentation
Boris Shirokikh, Anvar Kurmukov, Mariia Donskova, Valentin Samokhin, Mikhail Belyaev, Ivan Oseledets
TL;DR
M3DA addresses the lack of a large, public benchmark for unsupervised domain adaptation in 3D medical image segmentation by assembling eight clinically relevant domain shifts from four public datasets (AMOS, BraTS, CC359, LIDC) into eight tasks across 22 problems. The authors establish a standardized evaluation protocol using a nnU-Net–based baseline and an oracle, and survey over a dozen UDA methods spanning discrepancy-based, self-training, adversarial, image-level, and augmentation strategies, plus foundational backbones. Their extensive experiments show that no method consistently closes the domain gap, with the best approaches achieving roughly 62% gap reduction on average, highlighting the need for novel, robust techniques and emphasizing the strong impact of generic augmentations. The work further demonstrates that M3DA supports multiple DA paradigms beyond unsupervised DA (e.g., supervised, source-free, test-time, and domain generalization) and provides a clear, public benchmark to accelerate progress toward robust, scalable 3D medical image segmentation in real-world, heterogeneous clinical settings.
Abstract
Domain shift presents a significant challenge in applying Deep Learning to the segmentation of 3D medical images from sources like Magnetic Resonance Imaging (MRI) and Computed Tomography (CT). Although numerous Domain Adaptation methods have been developed to address this issue, they are often evaluated under impractical data shift scenarios. Specifically, the medical imaging datasets used are often either private, too small for robust training and evaluation, or limited to single or synthetic tasks. To overcome these limitations, we introduce a M3DA /"mEd@/ benchmark comprising four publicly available, multiclass segmentation datasets. We have designed eight domain pairs featuring diverse and practically relevant distribution shifts. These include inter-modality shifts between MRI and CT and intra-modality shifts among various MRI acquisition parameters, different CT radiation doses, and presence/absence of contrast enhancement in images. Within the proposed benchmark, we evaluate more than ten existing domain adaptation methods. Our results show that none of them can consistently close the performance gap between the domains. For instance, the most effective method reduces the performance gap by about 62% across the tasks. This highlights the need for developing novel domain adaptation algorithms to enhance the robustness and scalability of deep learning models in medical imaging. We made our M3DA benchmark publicly available: https://github.com/BorisShirokikh/M3DA.
