Table of Contents
Fetching ...

Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

Hongyi Miao, Jun Jia, Yankun Cao, Yingjie Zhou, Yanwei Jiang, Zhi Liu, Guangtao Zhai

TL;DR

Ultrasound-QBench tackles the challenge of assessing ultrasound image quality in a scalable, zero-shot setting by benchmarking multimodal LLMs on three tasks: qualitative classification, quantitative scoring, and comparative assessment. It introduces two real-world datasets, IVUSQA and CardiacUltraQA, and a three-task evaluation framework augmented with a Softmax+clustering strategy to improve prediction stability. Across eight MLLMs (seven open-source and one proprietary), results reveal limited zero-shot capabilities for qualitative and quantitative QA, with better performance in relative comparisons but persistent gaps in domain-specific understanding. The work highlights the potential of MLLMs to assist clinical QA while outlining concrete directions—reducing prompt dependency, expanding dataset diversity, and embedding domain knowledge—for practical, robust ultrasound image quality assessment.

Abstract

With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.

Ultrasound-QBench: Can LLMs Aid in Quality Assessment of Ultrasound Imaging?

TL;DR

Ultrasound-QBench tackles the challenge of assessing ultrasound image quality in a scalable, zero-shot setting by benchmarking multimodal LLMs on three tasks: qualitative classification, quantitative scoring, and comparative assessment. It introduces two real-world datasets, IVUSQA and CardiacUltraQA, and a three-task evaluation framework augmented with a Softmax+clustering strategy to improve prediction stability. Across eight MLLMs (seven open-source and one proprietary), results reveal limited zero-shot capabilities for qualitative and quantitative QA, with better performance in relative comparisons but persistent gaps in domain-specific understanding. The work highlights the potential of MLLMs to assist clinical QA while outlining concrete directions—reducing prompt dependency, expanding dataset diversity, and embedding domain knowledge—for practical, robust ultrasound image quality assessment.

Abstract

With the dramatic upsurge in the volume of ultrasound examinations, low-quality ultrasound imaging has gradually increased due to variations in operator proficiency and imaging circumstances, imposing a severe burden on diagnosis accuracy and even entailing the risk of restarting the diagnosis in critical cases. To assist clinicians in selecting high-quality ultrasound images and ensuring accurate diagnoses, we introduce Ultrasound-QBench, a comprehensive benchmark that systematically evaluates multimodal large language models (MLLMs) on quality assessment tasks of ultrasound images. Ultrasound-QBench establishes two datasets collected from diverse sources: IVUSQA, consisting of 7,709 images, and CardiacUltraQA, containing 3,863 images. These images encompassing common ultrasound imaging artifacts are annotated by professional ultrasound experts and classified into three quality levels: high, medium, and low. To better evaluate MLLMs, we decompose the quality assessment task into three dimensionalities: qualitative classification, quantitative scoring, and comparative assessment. The evaluation of 7 open-source MLLMs as well as 1 proprietary MLLMs demonstrates that MLLMs possess preliminary capabilities for low-level visual tasks in ultrasound image quality classification. We hope this benchmark will inspire the research community to delve deeper into uncovering and enhancing the untapped potential of MLLMs for medical imaging tasks.
Paper Structure (29 sections, 4 figures, 1 table)

This paper contains 29 sections, 4 figures, 1 table.

Figures (4)

  • Figure 1: In the proposed Ultrasound-QBench, we established the first benchmark on MLLM capabilities On ultrasound images, qualitative quality assessment, quantitative evaluation, and relative quality are included.
  • Figure 2: IVUSQA and CardiacUltraQA Dataset and Assessment Standard.
  • Figure 3: Quality Distribution combined for IVUSQA and CardiacUltraQA Dataset.
  • Figure 4: The proposed softmax-based quality assessment strategy for MLLMs improves upon existing methods by extracting logits for the 'high quality,' 'medium quality,' and 'low quality' categories, rather than directly decoding tokens from the [SCORE TOKEN] position. The strategy predicts labels through a weighted summation and pooling of these logits, followed by a weighted clustering to obtain the final quality rating.