EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance
Yang Yue, Yulin Wang, Haojun Jiang, Pan Liu, Shiji Song, Gao Huang
TL;DR
EchoWorld tackles the challenge of guiding echocardiography probes by learning motion-aware world models that jointly encode cardiac anatomy and motion-induced visual dynamics. It first pre-trains a cardiac world model using spatial and motion tasks within a JEPA framework, then fine-tunes with a motion-aware attention mechanism that ingests historical visual-motion data to predict probe movements toward ten clinically relevant planes. The approach is validated on a large echocardiography dataset, showing superior accuracy in both single-frame and sequential guidance tasks compared with diverse baselines, and enables visualizations that confirm meaningful attention and learned plane semantics. By integrating world-model representations with motion-conditioned attention, EchoWorld advances embodied medical imaging and holds promise for AI-assisted or autonomous ultrasound scanning. The work provides a principled representation-learning pathway for medical ultrasound that fuses anatomy with dynamic imaging processes, with practical implications for broader access to cardiac care.
Abstract
Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.
