Table of Contents
Fetching ...

U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

Anjie Le, Henan Liu, Yue Wang, Zhenyu Liu, Rongkun Zhu, Taohan Weng, Jinze Yu, Boyang Wang, Yalun Wu, Kaiwen Yan, Quanlin Sun, Meirui Jiang, Jialun Pei, Siya Liu, Haoyun Zheng, Zhoujun Li, Alison Noble, Jacques Souquet, Xiaoqing Guo, Manxi Lin, Hongcheng Guo

TL;DR

U2-BENCH introduces the first comprehensive benchmark for evaluating large vision-language models on ultrasound understanding, addressing the modality's noise, variability, and dynamic anatomy. The authors curate 7,241 ultrasound studies across 15 anatomies and define an eight-task taxonomy spanning classification, detection, regression, and generation, implemented over 50 clinical scenarios. They benchmark 20 LVLMs, revealing strong image-level diagnostic performance but persistent challenges in spatial reasoning and clinical language generation, with domain-specific models offering improvements in reasoning tasks. The work provides a rigorous, reproducible testbed and a data-driven framework (U2-Score) to guide future ultrasound-oriented LVLM development and clinical deployment considerations.

Abstract

Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

U2-BENCH: Benchmarking Large Vision-Language Models on Ultrasound Understanding

TL;DR

U2-BENCH introduces the first comprehensive benchmark for evaluating large vision-language models on ultrasound understanding, addressing the modality's noise, variability, and dynamic anatomy. The authors curate 7,241 ultrasound studies across 15 anatomies and define an eight-task taxonomy spanning classification, detection, regression, and generation, implemented over 50 clinical scenarios. They benchmark 20 LVLMs, revealing strong image-level diagnostic performance but persistent challenges in spatial reasoning and clinical language generation, with domain-specific models offering improvements in reasoning tasks. The work provides a rigorous, reproducible testbed and a data-driven framework (U2-Score) to guide future ultrasound-oriented LVLM development and clinical deployment considerations.

Abstract

Ultrasound is a widely-used imaging modality critical to global healthcare, yet its interpretation remains challenging due to its varying image quality on operators, noises, and anatomical structures. Although large vision-language models (LVLMs) have demonstrated impressive multimodal capabilities across natural and medical domains, their performance on ultrasound remains largely unexplored. We introduce U2-BENCH, the first comprehensive benchmark to evaluate LVLMs on ultrasound understanding across classification, detection, regression, and text generation tasks. U2-BENCH aggregates 7,241 cases spanning 15 anatomical regions and defines 8 clinically inspired tasks, such as diagnosis, view recognition, lesion localization, clinical value estimation, and report generation, across 50 ultrasound application scenarios. We evaluate 23 state-of-the-art LVLMs, both open- and closed-source, general-purpose and medical-specific. Our results reveal strong performance on image-level classification, but persistent challenges in spatial reasoning and clinical language generation. U2-BENCH establishes a rigorous and unified testbed to assess and accelerate LVLM research in the uniquely multimodal domain of medical ultrasound imaging.

Paper Structure

This paper contains 58 sections, 2 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: Examples of the 8 benchmark tasks in U2-BENCH across diverse anatomical regions. Each callout, consisting of the question prompt, expected output format, and sample output, highlights a representative ultrasound application scenario of the corresponding task. Tasks involve disease diagnosis (DD), view recognition and assessment (VRA), lesion localization (LL), organ detection (OD), keypoint detection (KD), clinical value estimation (CVE), report generation (RG) and caption generation (CG).
  • Figure 2: Distribution of benchmark tasks across 15 anatomical regions in U2-Bench. The colored boxes next to each anatomy name indicate the benchmark tasks available for that anatomy, with each color corresponding to one of the eight core tasks (legend shown on the right). The blue bar represents the total number of samples for each anatomy region, with its length proportional to the sample count. Multiple tasks may share samples from the same anatomical region, depending on annotation availability and clinical relevance.
  • Figure 3: Overview of the U2-BENCH construction pipeline. The benchmark is built through three stages: (1) data gathering from 40 licensed ultrasound datasets spanning 15 anatomical regions, (2) task definition across 8 clinically inspired tasks grouped into four core capabilities: classification, detection, regression, and text generation, (3) data preprocessing, including annotation standardization, metadata unification, image/frame selection, and quality verification. This unified pipeline ensures benchmark consistency and clinical relevance across diverse ultrasound scenarios.
  • Figure 4: Ultrasound image for Diagnosis Task 40: case001273