Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Xiaoran Zhang; Eric Z. Chen; Lin Zhao; Xiao Chen; Yikang Liu; Boris Maihe; James S. Duncan; Terrence Chen; Shanhui Sun

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

Xiaoran Zhang, Eric Z. Chen, Lin Zhao, Xiao Chen, Yikang Liu, Boris Maihe, James S. Duncan, Terrence Chen, Shanhui Sun

TL;DR

Ultrasound image segmentation remains challenging due to speckle noise, low SNR, and high anatomical variability, and existing supervised methods struggle to generalize with limited annotations. The authors adapt hierarchical vision foundation models (Hiera) by adding a lightweight Hiera adapter and interleaving DINOv2 semantic features, followed by a hierarchical decoder to produce pixel-wise segmentations. Across seven datasets (cardiac and thyroid), the approach achieves state-of-the-art region-overlap metrics and demonstrates strong data efficiency under 1%–10% supervision while maintaining real-time inference speed on a single GPU. This work offers a practical, robust pathway for deploying high-quality ultrasound segmentation in real-time clinical workflows, with potential extensions to video and 3D imaging.

Abstract

We propose a novel approach that adapts hierarchical vision foundation models for real-time ultrasound image segmentation. Existing ultrasound segmentation methods often struggle with adaptability to new tasks, relying on costly manual annotations, while real-time approaches generally fail to match state-of-the-art performance. To overcome these limitations, we introduce an adaptive framework that leverages the vision foundation model Hiera to extract multi-scale features, interleaved with DINOv2 representations to enhance visual expressiveness. These enriched features are then decoded to produce precise and robust segmentation. We conduct extensive evaluations on six public datasets and one in-house dataset, covering both cardiac and thyroid ultrasound segmentation. Experiments show that our approach outperforms state-of-the-art methods across multiple datasets and excels with limited supervision, surpassing nnUNet by over 20\% on average in the 1\% and 10\% data settings. Our method achieves $\sim$77 FPS inference speed with TensorRT on a single GPU, enabling real-time clinical applications.

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

TL;DR

Abstract

Adapting Vision Foundation Models for Real-time Ultrasound Image Segmentation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)