Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu; Xinyang Han; Jia Ai; Juan Wu; Tong Zhao; Tonghuan Xiao; Sheng Ning; Yuqi Yang; Jing Qin; Ann Dorothy King; Winnie Chiu-Wing Chu; Jing Cai; Michael Tin-Cheung Ying

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu, Xinyang Han, Jia Ai, Juan Wu, Tong Zhao, Tonghuan Xiao, Sheng Ning, Yuqi Yang, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying

TL;DR

The paper addresses the domain shift problem when applying vision-language foundation models to ultrasound imaging and proposes Hybrid-tuning (HT), a parameter-efficient adapter integrated into a frozen CLIP backbone. HT combines frequency-domain filtering and dynamic noise estimation within the adapter, along with multi-scale feature aggregation and lightweight segmentation and classification heads, enabling effective segmentation and classification in ultrasound. Through extensive experiments on six multi-center ultrasound datasets, HT-based models outperform state-of-the-art baselines like BiomedCLIP and LoRA, demonstrating strong data efficiency and robustness, even in few-shot scenarios. This work advances practical foundational ultrasound intelligence and suggests directions for further cross-modal alignment and test-time adaptation to bridge the gap between natural and sonographic vision.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

TL;DR

Abstract

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)