Table of Contents
Fetching ...

Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs

Yuezhe Yang, Yiyue Guo, Wenjie Cai, Qingqing Ruan, Siying Wang, Xingbo Dong, Zhe Jin, Yong Dai

TL;DR

Auto-US addresses the challenge of ultrasound video diagnosis by introducing a multimodal agent that fuses ultrasound video classification with clinical text reasoning. The authorsConstruct the CUV Dataset by integrating public ultrasound video sources and develop CTU-Net, a three-path CNN-Transformer architecture that jointly models spatial, temporal, and frequency information to achieve state-of-the-art accuracy ($86.73\%$) on multi-disease ultrasound videos. The system further integrates Large Language Models to generate clinically meaningful diagnostic suggestions, validated through case studies and an evaluation framework that blends expert judgment with METEOR-based metrics. Together, these components demonstrate notable potential for improved diagnostic efficiency and decision support in real-world ultrasound applications, with publicly available code and data. The work highlights both the promise of multi-modal AI in ultrasound and the need for larger, more diverse datasets and richer pathology integration to reach broader clinical adoption.

Abstract

AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73\% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.

Auto-US: An Ultrasound Video Diagnosis Agent Using Video Classification Framework and LLMs

TL;DR

Auto-US addresses the challenge of ultrasound video diagnosis by introducing a multimodal agent that fuses ultrasound video classification with clinical text reasoning. The authorsConstruct the CUV Dataset by integrating public ultrasound video sources and develop CTU-Net, a three-path CNN-Transformer architecture that jointly models spatial, temporal, and frequency information to achieve state-of-the-art accuracy () on multi-disease ultrasound videos. The system further integrates Large Language Models to generate clinically meaningful diagnostic suggestions, validated through case studies and an evaluation framework that blends expert judgment with METEOR-based metrics. Together, these components demonstrate notable potential for improved diagnostic efficiency and decision support in real-world ultrasound applications, with publicly available code and data. The work highlights both the promise of multi-modal AI in ultrasound and the need for larger, more diverse datasets and richer pathology integration to reach broader clinical adoption.

Abstract

AI-assisted ultrasound video diagnosis presents new opportunities to enhance the efficiency and accuracy of medical imaging analysis. However, existing research remains limited in terms of dataset diversity, diagnostic performance, and clinical applicability. In this study, we propose \textbf{Auto-US}, an intelligent diagnosis agent that integrates ultrasound video data with clinical diagnostic text. To support this, we constructed \textbf{CUV Dataset} of 495 ultrasound videos spanning five categories and three organs, aggregated from multiple open-access sources. We developed \textbf{CTU-Net}, which achieves state-of-the-art performance in ultrasound video classification, reaching an accuracy of 86.73\% Furthermore, by incorporating large language models, Auto-US is capable of generating clinically meaningful diagnostic suggestions. The final diagnostic scores for each case exceeded 3 out of 5 and were validated by professional clinicians. These results demonstrate the effectiveness and clinical potential of Auto-US in real-world ultrasound applications. Code and data are available at: https://github.com/Bean-Young/Auto-US.

Paper Structure

This paper contains 23 sections, 18 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Structure and Composition of the CUV Dataset. (a) Overview of sample categories in the CUV Dataset. (b) Pie chart of data distribution in the CUV Dataset.
  • Figure 2: AUC Distributions and Ablation Study of CTU-Net. (a) Radar chart of AUC across categories for different models. (b) Category-wise AUC comparison from CTU-Net ablation experiments.
  • Figure 3: A flowchart for constructing CUV Dataset.
  • Figure 4: Architecture diagram of our ultrasound video classification network.
  • Figure 5: Workflow diagram of Auto-US Agent.