Table of Contents
Fetching ...

Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

Kirato Yoshihara, Yohei Sugawara, Yuta Tokuoka, Lihang Hong

TL;DR

A novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation is proposed, introducing a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities.

Abstract

State-of-the-art vessel segmentation methods typically require large-scale annotated datasets and suffer from severe performance degradation under domain shifts. In clinical practice, however, acquiring extensive annotations for every new scanner or protocol is unfeasible. To address this, we propose a novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation. We introduce a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities, enabling the model to capture continuous vascular structures from limited data. We validated our method on the TopCoW (in-domain) and Lausanne (out-of-distribution) datasets. In the extreme few-shot regime with 5 training samples, our method achieved a Dice score of 43.42%, marking a 30% relative improvement over the state-of-the-art nnU-Net (33.41%) and outperforming other Transformer-based baselines, such as SwinUNETR and UNETR, by up to 45%. Furthermore, in the out-of-distribution setting, our model demonstrated superior robustness, achieving a 50% relative improvement over nnU-Net (21.37% vs. 14.22%), which suffered from severe domain overfitting. Ablation studies confirmed that our 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness. Our results suggest foundation models offer a viable cold-start solution, improving clinical reliability under data scarcity or domain shifts.

Breaking the Data Barrier: Robust Few-Shot 3D Vessel Segmentation using Foundation Models

TL;DR

A novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation is proposed, introducing a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities.

Abstract

State-of-the-art vessel segmentation methods typically require large-scale annotated datasets and suffer from severe performance degradation under domain shifts. In clinical practice, however, acquiring extensive annotations for every new scanner or protocol is unfeasible. To address this, we propose a novel framework leveraging a pre-trained Vision Foundation Model (DINOv3) adapted for volumetric vessel segmentation. We introduce a lightweight 3D Adapter for volumetric consistency, a multi-scale 3D Aggregator for hierarchical feature fusion, and Z-channel embedding to effectively bridge the gap between 2D pre-training and 3D medical modalities, enabling the model to capture continuous vascular structures from limited data. We validated our method on the TopCoW (in-domain) and Lausanne (out-of-distribution) datasets. In the extreme few-shot regime with 5 training samples, our method achieved a Dice score of 43.42%, marking a 30% relative improvement over the state-of-the-art nnU-Net (33.41%) and outperforming other Transformer-based baselines, such as SwinUNETR and UNETR, by up to 45%. Furthermore, in the out-of-distribution setting, our model demonstrated superior robustness, achieving a 50% relative improvement over nnU-Net (21.37% vs. 14.22%), which suffered from severe domain overfitting. Ablation studies confirmed that our 3D adaptation mechanism and multi-scale aggregation strategy are critical for vascular continuity and robustness. Our results suggest foundation models offer a viable cold-start solution, improving clinical reliability under data scarcity or domain shifts.
Paper Structure (12 sections, 6 equations, 3 figures, 2 tables)

This paper contains 12 sections, 6 equations, 3 figures, 2 tables.

Figures (3)

  • Figure 1: Overview of the proposed framework. (a) The overall architecture leverages a frozen 2D DINOv3 backbone to extract semantic features, while a parallel Lightweight 3D Adapter captures high-frequency volumetric details. (b) The 3D Aggregator fuses these multi-scale features via factorized slice and global spatial attention. (c) The decoupled convolution block within the 3D Adapter efficiently models spatial and inter-slice dependencies.
  • Figure 2: Label efficiency analysis on the TopCoW (ID) dataset. We plot the Dice score (test set) against the number of training samples (log scale). Our method (red) demonstrates superior performance in the few-shot regime ($N \le 9$), achieving a 30.0% relative improvement over nnU-Net at 5 samples. Furthermore, our method consistently outperforms all other baselines (SwinUNETR, UNETR, 3D U-Net) by a substantial margin across all data regimes, as these models fail to generalize even with full supervision ($<40\%$ Dice). While nnU-Net recovers capacity in the high-data regime, our approach remains the most robust solution under label scarcity.
  • Figure 3: Qualitative comparison on the unseen OOD Lausanne dataset (trained on 5 TopCoW subjects). Samples were randomly selected to ensure unbiased evaluation. Top Row: A 2D slice where nnU-Net fails due to domain shift, while our method correctly delineates the vessel. Bottom Row: 3D rendering of a different subject; the baseline produces fragmented results (blue circle), while our method preserves vascular connectivity (green circle).