Table of Contents
Fetching ...

Dolphin v1.0 Technical Report

Taohan Weng, Kaibing Hu, Henan Liu, Siya Liu, Xiaoyang Liu, Zhenyu Liu, Jiren Ren, Boyan Wang, Boyang Wang, Yiyu Wang, Yalun Wu, Chaoran Yan, Kaiwen Yan, Jinze Yu, Chi Zhang, Duo Zhang, Haoyun Zheng, Xiaoqing Guo, Jacques Souquet, Hongcheng Guo, Anjie Le

TL;DR

The paper tackles the challenge of applying multimodal foundation models to ultrasound, which is characterized by operator variance, real-time dynamics, and noise. It presents Dolphin v1.0 (V1) and its reasoning-augmented variant Dolphin R1, built on a 2-million-sample ultrasound dataset and a three-stage training pipeline (domain-specialized pretraining, instruction-driven alignment, reinforcement-based refinement) to unify tasks across classification, detection, regression, and generation. A key contribution is the Ultrasound Answer Reward (UAR) and a Bayesian reasoning framework for cross-domain reasoning, enabling deep reasoning and interpretability. On U2-Bench, Dolphin R1 achieves a U2-score of $0.5835$, significantly surpassing prior models, and ablations show that deep reasoning boosts diagnostic accuracy and interpretability, supporting practical deployment in clinical workflows.

Abstract

Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound's complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.

Dolphin v1.0 Technical Report

TL;DR

The paper tackles the challenge of applying multimodal foundation models to ultrasound, which is characterized by operator variance, real-time dynamics, and noise. It presents Dolphin v1.0 (V1) and its reasoning-augmented variant Dolphin R1, built on a 2-million-sample ultrasound dataset and a three-stage training pipeline (domain-specialized pretraining, instruction-driven alignment, reinforcement-based refinement) to unify tasks across classification, detection, regression, and generation. A key contribution is the Ultrasound Answer Reward (UAR) and a Bayesian reasoning framework for cross-domain reasoning, enabling deep reasoning and interpretability. On U2-Bench, Dolphin R1 achieves a U2-score of , significantly surpassing prior models, and ablations show that deep reasoning boosts diagnostic accuracy and interpretability, supporting practical deployment in clinical workflows.

Abstract

Ultrasound is crucial in modern medicine but faces challenges like operator dependence, image noise, and real-time scanning, hindering AI integration. While large multimodal models excel in other medical imaging areas, they struggle with ultrasound's complexities. To address this, we introduce Dolphin v1.0 (V1) and its reasoning-augmented version, Dolphin R1-the first large-scale multimodal ultrasound foundation models unifying diverse clinical tasks in a single vision-language framework.To tackle ultrasound variability and noise, we curated a 2-million-scale multimodal dataset, combining textbook knowledge, public data, synthetic samples, and general corpora. This ensures robust perception, generalization, and clinical adaptability.The Dolphin series employs a three-stage training strategy: domain-specialized pretraining, instruction-driven alignment, and reinforcement-based refinement. Dolphin v1.0 delivers reliable performance in classification, detection, regression, and report generation. Dolphin R1 enhances diagnostic inference, reasoning transparency, and interpretability through reinforcement learning with ultrasound-specific rewards.Evaluated on U2-Bench across eight ultrasound tasks, Dolphin R1 achieves a U2-score of 0.5835-over twice the second-best model (0.2968) setting a new state of the art. Dolphin v1.0 also performs competitively, validating the unified framework. Comparisons show reasoning-enhanced training significantly improves diagnostic accuracy, consistency, and interpretability, highlighting its importance for high-stakes medical AI.

Paper Structure

This paper contains 20 sections, 1 equation, 29 figures, 1 table.

Figures (29)

  • Figure 1: Data types of the Dolphin Ultrasound Large-Scale Dataset
  • Figure 2: Overview of the Dolphin Ultrasound Diagnostic Data Pipeline. The framework includes data collection, expert-guided filtering, and standardization via the Dolphin Ultrasound Data Protocol (DUDP), ensuring consistency across multimodal sources. Tasks are unified into four categories—classification, detection, regression, and generation—covering diverse clinical scenarios. The dataset distribution demonstrates balanced coverage across major anatomical regions, supporting robust model generalization.
  • Figure 3: Examples of constructed ultrasound question–answer pairs. The dataset integrates diverse clinical tasks, including fetal abdominal view assessment, lung ultrasound severity scoring, and soft tissue lesion evaluation, with corresponding expert-verified answers. These QA pairs provide structured supervision signals that enhance both low-level image understanding and high-level diagnostic reasoning in Dolphin models.
  • Figure 4: Training pipeline of the Dolphin series model. The process consists of three key stages: domain-specific pretraining on 2M multimodal medical data, instruction finetuning with 10.3k curated samples to enhance alignment with clinical tasks, and self-play reinforcement learning on 60.7k interactions to further improve reasoning and decision-making ability.
  • Figure 5: Model performance on General & Medical Bench
  • ...and 24 more figures