UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

Dengbo Chen; Ziwei Zhao; Kexin Zhang; Shishuang Zhao; Junjie Hou; Yaqian Wang; Nianxi Liao; Anlan Sun; Fei Gao; Jia Ding; Yuhang Liu; Dong Wang

UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

Dengbo Chen, Ziwei Zhao, Kexin Zhang, Shishuang Zhao, Junjie Hou, Yaqian Wang, Nianxi Liao, Anlan Sun, Fei Gao, Jia Ding, Yuhang Liu, Dong Wang

TL;DR

This work introduces UMind-VL, a unified ultrasound foundation model that bridges pixel-level Ultrasound Grounded Perception with Ultrasound Comprehensive Interpretation. It pairs a lightweight Dynamic Convolutional Mask Decoder with a token-based grounding scheme and a large-scale UMind-DS dataset (1.2 million image–text pairs across 16 regions) to enable segmentation, detection, keypoint localization, and diagnostic reasoning within a single framework. Across segmentation, detection, keypoint, and diagnosis tasks, UMind-VL matches or surpasses specialist models and significantly exceeds other generalist multimodal models, with strong out-of-distribution robustness. The approach promises practical impact for clinical ultrasound workflows by delivering coherent, grounded, and interpretable multimodal reasoning in real-world settings.

Abstract

Despite significant strides in medical foundation models, the ultrasound domain lacks a comprehensive solution capable of bridging low-level Ultrasound Grounded Perception (e.g., segmentation, localization) and high-level Ultrasound Comprehensive Interpretation (e.g., diagnosis, reasoning). To bridge this gap, we propose UMind-VL, a unified foundation model designed to synergize pixel-level structural understanding with complex clinical reasoning. We first introduce UMind-DS, a large-scale multimodal dataset comprising 1.2 million ultrasound image-text pairs across 16 anatomical regions, enriching standard data with pixel-level annotations and clinician-validated rationales. Architecturally, UMind-VL incorporates a lightweight Dynamic Convolutional Mask Decoder that generates masks via dynamic kernels conditioned on LLM outputs. This design, combined with task-specific tokens, unifies segmentation, detection, geometric measurement, and diagnosis tasks within a single framework. Extensive evaluations demonstrate that UMind-VL significantly outperforms existing generalist multimodal models and achieves performance on par with, or superior to, state-of-the-art specialist models across segmentation, detection, keypoint localization, and diagnostic reasoning benchmarks, while maintaining strong generalization ability. We demonstrate the capability of UMind-VL in Figure 1.

UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

TL;DR

Abstract

UMind-VL: A Generalist Ultrasound Vision-Language Model for Unified Grounded Perception and Comprehensive Interpretation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (23)