MedDINOv3: How to adapt vision foundation models for medical image segmentation?
Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang
TL;DR
MedDINOv3 demonstrates that vision foundation models pretrained on natural images can be effectively adapted for medical image segmentation through a simple ViT-based backbone with multi-scale token aggregation and high-resolution training, coupled with a three-stage domain-adaptive pretraining on CT data. The approach includes global/local self-distillation, gram anchoring, and high-resolution adaptation on a large CT-3M dataset, with ablations showing Stage1 and Stage3 drive substantial gains while Stage2 is optional. Across four public CT/MRI benchmarks, MedDINOv3 matches or surpasses state-of-the-art methods and often outperforms the strong nnU-Net baseline, illustrating strong transferability when domain alignment is performed. These results indicate that carefully designed architectural refinements plus domain-aligned pretraining can enable vision foundation models to serve as unified backbones for medical image segmentation, with practical implications for cross-institution and cross-modality radiology workflows.
Abstract
Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.
