Table of Contents
Fetching ...

Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge

Zehui Lin, Luyi Han, Xin Wang, Ying Zhou, Yanming Zhang, Tianyu Zhang, Lingyun Bao, Shandong Wu, Dong Xu, Tao Tan, the UUSIC25 Challenge Consortium

TL;DR

Ultrasound AI has been largely fragmented into organ-/task-specific tools. The UUSIC25 challenge demonstrates that a single general-purpose model can jointly perform multi-organ segmentation and classification with strong accuracy and favorable inference efficiency, even across a fully private, multi-center test set. However, significant generalization gaps emerge when applying models to unseen data, underscoring the need for improved domain generalization and regulatory pathways for multi-task deployment. The work advocates for all-in-one ultrasound AI systems and provides a roadmap for architecture, data strategy, and evaluation to advance clinically robust, cross-organ decision support.

Abstract

IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.

Diagnostic Performance of Universal-Learning Ultrasound AI Across Multiple Organs and Tasks: the UUSIC25 Challenge

TL;DR

Ultrasound AI has been largely fragmented into organ-/task-specific tools. The UUSIC25 challenge demonstrates that a single general-purpose model can jointly perform multi-organ segmentation and classification with strong accuracy and favorable inference efficiency, even across a fully private, multi-center test set. However, significant generalization gaps emerge when applying models to unseen data, underscoring the need for improved domain generalization and regulatory pathways for multi-task deployment. The work advocates for all-in-one ultrasound AI systems and provides a roadmap for architecture, data strategy, and evaluation to advance clinically robust, cross-organ decision support.

Abstract

IMPORTANCE: Current ultrasound AI remains fragmented into single-task tools, limiting clinical utility compared to versatile modern ultrasound systems. OBJECTIVE: To evaluate the diagnostic accuracy and efficiency of single general-purpose deep learning models for multi-organ classification and segmentation. DESIGN: The Universal UltraSound Image Challenge 2025 (UUSIC25) involved developing algorithms on 11,644 images (public/private). Evaluation used an independent, multi-center test set of 2,479 images, including data from a center completely unseen during training to assess generalization. OUTCOMES: Diagnostic performance (Dice Similarity Coefficient [DSC]; Area Under the Receiver Operating Characteristic Curve [AUC]) and computational efficiency (inference time, GPU memory). RESULTS: Of 15 valid algorithms, the top model (SMART) achieved a macro-averaged DSC of 0.854 across 5 segmentation tasks and AUC of 0.766 for binary classification. Models showed high capability in segmentation (e.g., fetal head DSC: 0.942) but variability in complex tasks subject to domain shift. Notably, in breast cancer molecular subtyping, the top model's performance dropped from AUC 0.571 (internal) to 0.508 (unseen external center), highlighting generalization challenges. CONCLUSIONS: General-purpose AI models achieve high accuracy and efficiency across multiple tasks using a single architecture. However, performance degradation on unseen data suggests domain generalization is critical for future clinical deployment.

Paper Structure

This paper contains 16 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: Study Flow Diagram. Data collection, source attribution, and stratification logic. The diagram illustrates the explicit separation of data streams: public datasets were utilized exclusively for training to promote generalization (n=10,010), while internal private data were stratified across all sets (n=5,499). Data from the external center (NKI, n=512) served as a strictly held-out test set. Data Sources: Public datasets included BUSI al2020dataset, BUSIS zhang2022busis, and BUS-BRA gomez2024bus for breast; DDTI pedraza2015open for thyroid; Fatty-Liver byra2018transfer for liver; KidneyUS singla2023open for kidney; Fetal HC van2018automated for fetal head; CAMUS leclerc2019deep for cardiac; and the Appendix Dataset marcinkevics_2023_7711412. Abbreviations: Internal centers in China include Zhejiang Cancer Hospital and Hangzhou First People's Hospital. The external center in the Netherlands refers to the Netherlands Cancer Institute (NKI).
  • Figure 2: Comprehensive Evaluation of Diagnostic Performance and Efficiency.(A) Segmentation performance profiles (Dice Similarity Coefficient, DSC) across five anatomical regions for the top-15 algorithms. (B) Classification performance profiles (Area Under the Curve, AUC) across diverse diagnostic tasks. Note that while top models show consistent segmentation capability in (A), classification performance in (B) varies significantly by task difficulty. (C) Efficiency-Accuracy trade-off. The y-axis represents the composite performance score (higher is better), and the x-axis shows inference time (left is better). Marker size indicates GPU memory usage. A distinct trade-off is visible, with top-ranking algorithms occupying the optimal zone (high accuracy, low resource usage). (D-E) Receiver Operating Characteristic (ROC) curves for two clinically distinct tasks: Breast Malignancy (focal lesion) and Fatty Liver (diffuse texture). Top models maintain high discrimination (AUC > 0.83) in both scenarios. Additional visualizations are available in eFigure 2 and eFigure 3 in the Supplement.