Table of Contents
Fetching ...

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

Fadillah Maani, Numan Saeed, Tausifa Saleem, Zaid Farooq, Hussain Alasmawi, Werner Diehl, Ameera Mohammad, Gareth Waring, Saudabi Valappi, Leanne Bricker, Mohammad Yaqub

TL;DR

FetalCLIP introduces a domain-specific visual-language foundation model for fetal ultrasound analysis, addressing the limitations of general medical VISUAL-LANGUAGE models by training on a large paired dataset of fetal images and captions. It leverages a CLIP-style objective with a ViT-L image encoder and a 12-layer text transformer to produce shared embeddings, enabling robust zero-shot classification of fetal views, zero-shot gestational age estimation, and effective downstream transfer to CHD detection and segmentation. Across extensive benchmarks, FetalCLIP achieves state-of-the-art zero-shot performance (e.g., average F1 for view classification of 87.1%), improved CHD detection AUROC, and high Dice scores for segmentation, while maintaining interpretability through CAM and UMAP analyses. The work demonstrates strong generalization and data efficiency, highlights the value of domain-specific multimodal pretraining, and provides open access to code and pretrained weights to accelerate further research and clinical deployment.

Abstract

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

FetalCLIP: A Visual-Language Foundation Model for Fetal Ultrasound Image Analysis

TL;DR

FetalCLIP introduces a domain-specific visual-language foundation model for fetal ultrasound analysis, addressing the limitations of general medical VISUAL-LANGUAGE models by training on a large paired dataset of fetal images and captions. It leverages a CLIP-style objective with a ViT-L image encoder and a 12-layer text transformer to produce shared embeddings, enabling robust zero-shot classification of fetal views, zero-shot gestational age estimation, and effective downstream transfer to CHD detection and segmentation. Across extensive benchmarks, FetalCLIP achieves state-of-the-art zero-shot performance (e.g., average F1 for view classification of 87.1%), improved CHD detection AUROC, and high Dice scores for segmentation, while maintaining interpretability through CAM and UMAP analyses. The work demonstrates strong generalization and data efficiency, highlights the value of domain-specific multimodal pretraining, and provides open access to code and pretrained weights to accelerate further research and clinical deployment.

Abstract

Foundation models are becoming increasingly effective in the medical domain, offering pre-trained models on large datasets that can be readily adapted for downstream tasks. Despite progress, fetal ultrasound images remain a challenging domain for foundation models due to their inherent complexity, often requiring substantial additional training and facing limitations due to the scarcity of paired multimodal data. To overcome these challenges, here we introduce FetalCLIP, a vision-language foundation model capable of generating universal representation of fetal ultrasound images. FetalCLIP was pre-trained using a multimodal learning approach on a diverse dataset of 210,035 fetal ultrasound images paired with text. This represents the largest paired dataset of its kind used for foundation model development to date. This unique training approach allows FetalCLIP to effectively learn the intricate anatomical features present in fetal ultrasound images, resulting in robust representations that can be used for a variety of downstream applications. In extensive benchmarking across a range of key fetal ultrasound applications, including classification, gestational age estimation, congenital heart defect (CHD) detection, and fetal structure segmentation, FetalCLIP outperformed all baselines while demonstrating remarkable generalizability and strong performance even with limited labeled data. We plan to release the FetalCLIP model publicly for the benefit of the broader scientific community.

Paper Structure

This paper contains 26 sections, 1 equation, 10 figures, 1 table.

Figures (10)

  • Figure 1: Overview of FetalCLIP development and performance.a, Dataset curation of fetal ultrasound image-caption pairs used for the FetalCLIP pretraining. The pretraining data was curated from two sources: (1) routine pregnancy ultrasound scans, comprising 207,943 images with corresponding LLM-generated pseudo captions, which incorporate clinicians' labels, gestational age, and pixel spacing; and (2) 2,092 image-caption pairs derived from a fetal ultrasound textbook. b, FetalCLIP pretraining step through contrastive learning, maximizing similarity between paired image-captions while minimizing similarity to unrelated pairs. c, Schematic diagram illustrating FetalCLIP’s capability and radar plot demonstrating FetalCLIP's superior performance over existing vision-language foundation models across diverse fetal ultrasound tasks, including fetal planes classification, congenital heart disease (CHD) detection, and fetal structures segmentation on different views. d, Distribution of routine pregnancy ultrasound scan data, which constitutes the largest portion of the FetalCLIP pretraining data.
  • Figure 1: Confusion matrices for five standard fetal planes and three brain subviews classifications.a-c, represent the confusion matrices of FetalCLIP, SonoNet, and UniMed-CLIP, respectively.
  • Figure 2: Zero-shot capabilities of FetalCLIP.a, Illustration of zero-shot fetal plane classification. We leveraged an LLM to generate prompts for a set of predefined candidate planes (detailed in Extended Data Fig. \ref{['prompts_for_inference']}). The predicted plane was determined by identifying the highest similarity between the image embedding and prompt embeddings. b, Zero-shot performance in distinguishing five standard fetal planes and three brain subplanes. FetalCLIP achieved the highest accuracy with an average F1 score of 87.1%, outperforming the specialist model SonoNet by 17.2%. c, Illustration of zero-shot GA estimation. A similarity map was computed between the image embeddings and prompts embeddings spanning 14 to 40 weeks of GA. We then subsequently postprocessed the similarity map to predict GA. d, GA estimation performance of visual-language foundation models. The blue points represent valid predictions, while the red points indicate invalid predictions. The black line represents the 50th percentile of the quantile regression population, and the orange lines represent the 2.5th and 97.5th percentiles of the population as provided by the WHO fetalcalculator. Unlike FetalCLIP, other models demonstrated no ability to infer GA from fetal ultrasound head images.
  • Figure 2: Examples of various image views from the private hospital dataset.a, Representative examples of standard views from the fetal ultrasound dataset, showcasing diverse anatomical planes such as 4CH, Femur, Kidney, and Transcerebellum. b, Examples of mislabeled samples detected by Confident Learning. c, Ultrasound images containing multiple clinician labels.
  • Figure 3: Linear probing for classification tasks.a, Schematic of linear probing for classifying different fetal planes. The image encoder of a visual-language foundation model was used to extract image embeddings, followed by a trainable linear layer for classification. b-c, F1 scores in the testing set for fetal plane and brain subplane classification, from 5-fold cross-validations with five different seeds. The bars represent the mean F1 scores, while the error bars indicate the standard deviation. d, Illustration of linear probing for CHD detection from an ultrasound clip. Embeddings were extracted from each image in the clip and concatenated. A trainable linear layer was then applied to leverage the combined embeddings for classification. e, AUROC comparisons for CHD detection across 5-fold cross-validations with 5 different seeds. f, ROC curve for CHD prediction showing the median performance of each model.
  • ...and 5 more figures