Table of Contents
Fetching ...

Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound

Edoardo Conti, Riccardo Rosati, Lorenzo Federici, Adriano Mancini, Maria Chiara Fiorentin

TL;DR

This work tackles the challenge of discriminating closely related fetal brain planes in ultrasound when inter-class variability is low. It introduces the FetalUS-188K multicenter benchmark and evaluates a DINOv3-based self-supervised pretraining approach, comparing domain-adaptive pretraining on fetal US data against initialization from natural-image weights for TT, TV, and TC plane classification ($TT$, $TV$, $TC$). The study shows that domain-specific pretraining yields substantial gains (up to $20\%$ in weighted F1) and preserves subtle echogenic cues necessary for fine-grained discrimination, whereas generic foundation models fail to generalize in this setting. Together, these findings imply that deploying foundation models in clinical fetal ultrasound requires domain-tailored pretraining and carefully designed transfer strategies to ensure robust, clinically reliable plane identification.

Abstract

Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.

Challenging DINOv3 Foundation Model under Low Inter-Class Variability: A Case Study on Fetal Brain Ultrasound

TL;DR

This work tackles the challenge of discriminating closely related fetal brain planes in ultrasound when inter-class variability is low. It introduces the FetalUS-188K multicenter benchmark and evaluates a DINOv3-based self-supervised pretraining approach, comparing domain-adaptive pretraining on fetal US data against initialization from natural-image weights for TT, TV, and TC plane classification (, , ). The study shows that domain-specific pretraining yields substantial gains (up to in weighted F1) and preserves subtle echogenic cues necessary for fine-grained discrimination, whereas generic foundation models fail to generalize in this setting. Together, these findings imply that deploying foundation models in clinical fetal ultrasound requires domain-tailored pretraining and carefully designed transfer strategies to ensure robust, clinically reliable plane identification.

Abstract

Purpose: This study provides the first comprehensive evaluation of foundation models in fetal ultrasound (US) imaging under low inter-class variability conditions. While recent vision foundation models such as DINOv3 have shown remarkable transferability across medical domains, their ability to discriminate anatomically similar structures has not been systematically investigated. We address this gap by focusing on fetal brain standard planes--transthalamic (TT), transventricular (TV), and transcerebellar (TC)--which exhibit highly overlapping anatomical features and pose a critical challenge for reliable biometric assessment. Methods: To ensure a fair and reproducible evaluation, all publicly available fetal ultrasound datasets were curated and aggregated into a unified multicenter benchmark, FetalUS-188K, comprising more than 188,000 annotated images from heterogeneous acquisition settings. DINOv3 was pretrained in a self-supervised manner to learn ultrasound-aware representations. The learned features were then evaluated through standardized adaptation protocols, including linear probing with frozen backbone and full fine-tuning, under two initialization schemes: (i) pretraining on FetalUS-188K and (ii) initialization from natural-image DINOv3 weights. Results: Models pretrained on fetal ultrasound data consistently outperformed those initialized on natural images, with weighted F1-score improvements of up to 20 percent. Domain-adaptive pretraining enabled the network to preserve subtle echogenic and structural cues crucial for distinguishing intermediate planes such as TV. Conclusion: Results demonstrate that generic foundation models fail to generalize under low inter-class variability, whereas domain-specific pretraining is essential to achieve robust and clinically reliable representations in fetal brain ultrasound imaging.

Paper Structure

This paper contains 9 sections, 1 equation, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Overview of the proposed pipeline. (1) All publicly available fetal ultrasound datasets are aggregated and curated to build the FetalUS-188K dataset, ensuring heterogeneous and diverse ultrasound acquisitions. (2) A self-supervised training strategy based on the DINOv3 pipeline is employed to learn a feature extractor specifically aware of fetal ultrasound characteristics. (3) The learned representations are transferred to the downstream task of fetal brain standard plane classification.
  • Figure 2: Confusion matrices for TT, TV, TC obtained using linear probing on different ViT architectures. Top row: FetalUS-188K pretrained weights; bottom row: LVD-1689M ones.
  • Figure 3: Confusion matrices for TT, TV, TC obtained using full fine-tuning on ViT-B/16 architecture. Left: FetalUS-188K pretrained weights; right: LVD-1689M pretrained ones.
  • Figure 4: Principal component visualization from FetalUS-188K pretrained DINOv3 features. Each row corresponds to a fetal brain standard plane (TV, TC, TT), and columns show the first three principal components (PC1–PC3) and their RGB composite.