Table of Contents
Fetching ...

OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning

Xiaoyu Zheng, Xu Chen, Awais Rauf, Qifan Fu, Benedetta Monosi, Felice Rivellese, Myles J. Lewis, Shaogang Gong, Gregory Slabaugh

TL;DR

OpenUS tackles the challenge of building a generalizable, label-efficient foundation model for ultrasound by integrating a Vision Mamba backbone with a novel self-adaptive masking framework and global-local masked contrastive learning. Trained on 308K public US images from 42 datasets, it achieves superior segmentation and competitive classification across multiple downstream tasks, while maintaining strong label efficiency. The approach combines self-distillation masked image modeling with an adaptive mask generation strategy that fuses teacher attention and student difficulty, enabling robust learning over ultrasound-specific artifacts like speckle. Overall, OpenUS demonstrates the feasibility and value of fully open-source ultrasound foundation models for scalable, reproducible AI-assisted ultrasound research and clinical workflows.

Abstract

Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.

OpenUS: A Fully Open-Source Foundation Model for Ultrasound Image Analysis via Self-Adaptive Masked Contrastive Learning

TL;DR

OpenUS tackles the challenge of building a generalizable, label-efficient foundation model for ultrasound by integrating a Vision Mamba backbone with a novel self-adaptive masking framework and global-local masked contrastive learning. Trained on 308K public US images from 42 datasets, it achieves superior segmentation and competitive classification across multiple downstream tasks, while maintaining strong label efficiency. The approach combines self-distillation masked image modeling with an adaptive mask generation strategy that fuses teacher attention and student difficulty, enabling robust learning over ultrasound-specific artifacts like speckle. Overall, OpenUS demonstrates the feasibility and value of fully open-source ultrasound foundation models for scalable, reproducible AI-assisted ultrasound research and clinical workflows.

Abstract

Ultrasound (US) is one of the most widely used medical imaging modalities, thanks to its low cost, portability, real-time feedback, and absence of ionizing radiation. However, US image interpretation remains highly operator-dependent and varies significantly across anatomical regions, acquisition protocols, and device types. These variations, along with unique challenges such as speckle, low contrast, and limited standardized annotations, hinder the development of generalizable, label-efficient ultrasound AI models. In this paper, we propose OpenUS, the first reproducible, open-source ultrasound foundation model built on a large collection of public data. OpenUS employs a vision Mamba backbone, capturing both local and global long-range dependencies across the image. To extract rich features during pre-training, we introduce a novel self-adaptive masking framework that combines contrastive learning with masked image modeling. This strategy integrates the teacher's attention map with student reconstruction loss, adaptively refining clinically-relevant masking to enhance pre-training effectiveness. OpenUS also applies a dynamic learning schedule to progressively adjust the difficulty of the pre-training process. To develop the foundation model, we compile the largest to-date public ultrasound dataset comprising over 308K images from 42 publicly available datasets, covering diverse anatomical regions, institutions, imaging devices, and disease types. Our pre-trained OpenUS model can be easily adapted to specific downstream tasks by serving as a backbone for label-efficient fine-tuning. Code is available at https://github.com/XZheng0427/OpenUS.

Paper Structure

This paper contains 37 sections, 7 equations, 10 figures, 8 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of Universal US Foundation Model.
  • Figure 1: Quantitative results on classification tasks.
  • Figure 2: (a) The OpenUS pipeline including the Teacher and Student models, masking approaches and reconstructions heads. For global and local views, we design two distinct masking strategies: (b) self-adaptive masking and (c) random block-wise masking. Both are integrated with masked image reconstruction and contrastive learning.
  • Figure 3: Visual comparison with segmentation ground truths, attention-only, reconstruction loss and self-adaptive $ALP$ scores. In the last column, the red areas tend to overlap with clinically relevant regions, while the dark grey regions represent the remaining randomly masked areas.
  • Figure 4: Visualization of US segmentation results on TN3K and BUS-BRA. The ground truth is depicted in green, and the prediction is shown in yellow.
  • ...and 5 more figures