Table of Contents
Fetching ...

Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

Jiayun Jin, Haolong Chai, Xueying Huang, Xiaoqing Guo, Zengwei Zheng, Zhan Zhou, Junmei Wang, Xinyu Wang, Jie Liu, Binbin Zhou

Abstract

Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

Ultrasound-CLIP: Semantic-Aware Contrastive Pre-training for Ultrasound Image-Text Understanding

Abstract

Ultrasound imaging is widely used in clinical diagnostics due to its real-time capability and radiation-free nature. However, existing vision-language pre-training models, such as CLIP, are primarily designed for other modalities, and are difficult to directly apply to ultrasound data, which exhibit heterogeneous anatomical structures and diverse diagnostic attributes. To bridge this gap, we construct US-365K, a large-scale ultrasound image-text dataset containing 365k paired samples across 52 anatomical categories. We establish Ultrasonographic Diagnostic Taxonomy (UDT) containing two hierarchical knowledge frameworks. Ultrasonographic Hierarchical Anatomical Taxonomy standardizes anatomical organization, and Ultrasonographic Diagnostic Attribute Framework formalizes nine diagnostic dimensions, including body system, organ, diagnosis, shape, margins, echogenicity, internal characteristics, posterior acoustic phenomena, and vascularity. Building upon these foundations, we propose Ultrasound-CLIP, a semantic-aware contrastive learning framework that introduces semantic soft labels and semantic loss to refine sample discrimination. Moreover, we construct a heterogeneous graph modality derived from UDAF's textual representations, enabling structured reasoning over lesion-attribute relations. Extensive experiments with patient-level data splitting demonstrate that our approach achieves state-of-the-art performance on classification and retrieval benchmarks, while also delivering strong generalization to zero-shot, linear probing, and fine-tuning tasks.

Paper Structure

This paper contains 44 sections, 10 equations, 13 figures, 9 tables.

Figures (13)

  • Figure 1: Ultrasound image statistics across major benchmarks. The Red segment and internal percentage show the proportion of ultrasound images, while the blue segment show the remaining modalities. The top label indicates the absolute number (in thousands). Our US-365K is the first large-scale, 100% dedicated ultrasound dataset.
  • Figure 2: Overview of UDT and Ultrasound-CLIP. (a) The UDT serves as the semantic foundation, formalizing sonographic knowledge by standardizing anatomical hierarchies (UHAT) and defining 9 key diagnostic attributes (UDAF). (b) The Ultrasound-CLIP leverages UDT in two ways: (1) A UDAF-guided heterogeneous graph encoder fuses attribute relationships into the text embedding via cross-attention to model structured reasoning. (2) UDAF-based semantic priors are constructed to enable a dual-objective optimization that resolves ambiguity. The framework aligns visual features with these graph-enhanced, semantically-aware text representations.
  • Figure 3: Visualization of US-365K's UHAT-based anatomical hierarchy.
  • Figure 4: t-SNE visualization of text embeddings without and with our UDAF-guided graph encoder, with categories colored by diagnosis: fluid collection (blue), mass (purple), normal appearance (red), cyst (green), and nodule (yellow).
  • Figure 5: Visualization of diagnostic reasoning for left ankle ultrasound showing strong clinical coherence.
  • ...and 8 more figures