Table of Contents
Fetching ...

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

Jingguo Qu, Xinyang Han, Jia Ai, Juan Wu, Tong Zhao, Tonghuan Xiao, Sheng Ning, Yuqi Yang, Jing Qin, Ann Dorothy King, Winnie Chiu-Wing Chu, Jing Cai, Michael Tin-Cheung Ying

TL;DR

The paper addresses the domain shift problem when applying vision-language foundation models to ultrasound imaging and proposes Hybrid-tuning (HT), a parameter-efficient adapter integrated into a frozen CLIP backbone. HT combines frequency-domain filtering and dynamic noise estimation within the adapter, along with multi-scale feature aggregation and lightweight segmentation and classification heads, enabling effective segmentation and classification in ultrasound. Through extensive experiments on six multi-center ultrasound datasets, HT-based models outperform state-of-the-art baselines like BiomedCLIP and LoRA, demonstrating strong data efficiency and robustness, even in few-shot scenarios. This work advances practical foundational ultrasound intelligence and suggests directions for further cross-modal alignment and test-time adaptation to bridge the gap between natural and sonographic vision.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

Adapting Vision-Language Foundation Model for Next Generation Medical Ultrasound Image Analysis

TL;DR

The paper addresses the domain shift problem when applying vision-language foundation models to ultrasound imaging and proposes Hybrid-tuning (HT), a parameter-efficient adapter integrated into a frozen CLIP backbone. HT combines frequency-domain filtering and dynamic noise estimation within the adapter, along with multi-scale feature aggregation and lightweight segmentation and classification heads, enabling effective segmentation and classification in ultrasound. Through extensive experiments on six multi-center ultrasound datasets, HT-based models outperform state-of-the-art baselines like BiomedCLIP and LoRA, demonstrating strong data efficiency and robustness, even in few-shot scenarios. This work advances practical foundational ultrasound intelligence and suggests directions for further cross-modal alignment and test-time adaptation to bridge the gap between natural and sonographic vision.

Abstract

Vision-Language Models (VLMs) have demonstrated remarkable generalization capabilities, yet their application to medical ultrasound remains constrained by the significant domain shift between natural images and sonographic data. The unique physics of ultrasound, manifesting as speckle noise, shadowing, and variable artifacts, often leads to suboptimal performance when applying off-the-shelf foundation models. To address this, we propose a novel Hybrid-tuning (HT) strategy for the efficient adaptation of CLIP-based models to ultrasound analysis. Our method introduces a lightweight adapter module integrated into the frozen visual backbone, featuring frequency-domain filtering to suppress periodic artifacts and dynamic noise estimation to calibrate feature representations. Furthermore, we design specialized segmentation and classification heads that employ multi-scale feature aggregation to maximize the utility of pre-trained semantic priors. Extensive evaluations across six multi-center datasets (covering lymph nodes, breast, thyroid, and prostate) reveal that our HT-enhanced models significantly outperform existing state-of-the-art methods, including BiomedCLIP and standard LoRA fine-tuning. The results highlight the superior data efficiency and robustness of our approach, paving the way for practical, foundational intelligence in automated ultrasound diagnosis. The source code is available at https://github.com/jinggqu/NextGen-UIA.

Paper Structure

This paper contains 16 sections, 5 equations, 5 figures, 7 tables.

Figures (5)

  • Figure 1: Overview of proposed workflow. (a) Fine-tuning stage. Introduce trainable HT adapter into frozen CLIP to bridge the domain gap between natural images and radiological scans. (b) Downstream tasks. Apply trainable heads for ultrasound image segmentation and classification in a supervised manner (solid arrows), and assess zero-shot ultrasound diagnosis capability of CLIP by using ensembled prompt-image pairs (dashed arrows).
  • Figure 2: Structure overview of the original LoRA, Mona and HT adapter based on Mona for CLIP vision encoder (ViT).
  • Figure 3: Architecture of Hybrid-tuning module.
  • Figure 4: Segmentation visualization on LN-INT, LN-EXT, BUSI busi_al_2020, DDTI ddti_pedraza_2015, TN3K tn3k_gong_2023, and Prostate microsegnet_jiang_2023 datasets. The first column shows the input images. Regions in red, green and yellow indicate the ground truth, false positive and true positive, respectively. Hybrid-tuned models are in bold.
  • Figure 5: Few-shot segmentation visualization on LN-INT dataset. The first column shows the input images. The percentages represent the amount of data used for model training. Regions in red, green and yellow indicate the ground truth, false positive and true positive, respectively.