Table of Contents
Fetching ...

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

Yuheng Li, Yizhou Wu, Yuxiang Lai, Mingzhe Hu, Xiaofeng Yang

TL;DR

MedDINOv3 demonstrates that vision foundation models pretrained on natural images can be effectively adapted for medical image segmentation through a simple ViT-based backbone with multi-scale token aggregation and high-resolution training, coupled with a three-stage domain-adaptive pretraining on CT data. The approach includes global/local self-distillation, gram anchoring, and high-resolution adaptation on a large CT-3M dataset, with ablations showing Stage1 and Stage3 drive substantial gains while Stage2 is optional. Across four public CT/MRI benchmarks, MedDINOv3 matches or surpasses state-of-the-art methods and often outperforms the strong nnU-Net baseline, illustrating strong transferability when domain alignment is performed. These results indicate that carefully designed architectural refinements plus domain-aligned pretraining can enable vision foundation models to serve as unified backbones for medical image segmentation, with practical implications for cross-institution and cross-modality radiology workflows.

Abstract

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

MedDINOv3: How to adapt vision foundation models for medical image segmentation?

TL;DR

MedDINOv3 demonstrates that vision foundation models pretrained on natural images can be effectively adapted for medical image segmentation through a simple ViT-based backbone with multi-scale token aggregation and high-resolution training, coupled with a three-stage domain-adaptive pretraining on CT data. The approach includes global/local self-distillation, gram anchoring, and high-resolution adaptation on a large CT-3M dataset, with ablations showing Stage1 and Stage3 drive substantial gains while Stage2 is optional. Across four public CT/MRI benchmarks, MedDINOv3 matches or surpasses state-of-the-art methods and often outperforms the strong nnU-Net baseline, illustrating strong transferability when domain alignment is performed. These results indicate that carefully designed architectural refinements plus domain-aligned pretraining can enable vision foundation models to serve as unified backbones for medical image segmentation, with practical implications for cross-institution and cross-modality radiology workflows.

Abstract

Accurate segmentation of organs and tumors in CT and MRI scans is essential for diagnosis, treatment planning, and disease monitoring. While deep learning has advanced automated segmentation, most models remain task-specific, lacking generalizability across modalities and institutions. Vision foundation models (FMs) pretrained on billion-scale natural images offer powerful and transferable representations. However, adapting them to medical imaging faces two key challenges: (1) the ViT backbone of most foundation models still underperform specialized CNNs on medical image segmentation, and (2) the large domain gap between natural and medical images limits transferability. We introduce MedDINOv3, a simple and effective framework for adapting DINOv3 to medical segmentation. We first revisit plain ViTs and design a simple and effective architecture with multi-scale token aggregation. Then, we perform domain-adaptive pretraining on CT-3M, a curated collection of 3.87M axial CT slices, using a multi-stage DINOv3 recipe to learn robust dense features. MedDINOv3 matches or exceeds state-of-the-art performance across four segmentation benchmarks, demonstrating the potential of vision foundation models as unified backbones for medical image segmentation. The code is available at https://github.com/ricklisz/MedDINOv3.

Paper Structure

This paper contains 24 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: MedDINOv3 PCA maps at progressively higher resolution. We visualize dense features of MedDINOv3 by mapping the first three components of a PCA computed over the feature space to RGB. We mask the feature maps to focus on the CT foreground.
  • Figure 2: Overall framework of MedDINOv3. a). Stage 1: Given an input CT, we feed the global crops to the teacher model, local and masked crops to the student. Self-distillation loss is applied to the CLS tokens and masking loss applied to dense patch tokens. b). Stage 2: Adds gram anchoring. Gram teacher sees a higher resolution global crop and outputs dense feature maps, resized to match student resolution. Stage 3: Both student and teacher are trained with higher-res CT inputs (not shown). c). Finetuning pretrained MedDINOv3 for segmentation with proposed architecture.
  • Figure 3: High-resolution dense features of MedDINOv3. We visualize the cosine similarity maps between the patches marked with a red dot and all other patches. Input image at 2048 $\times$ 2048.
  • Figure 4: Evolution of the cosine similarity between the reference patch (marked in red) and all other patches. We did not observe severe patch degradation in stage 1.