Table of Contents
Fetching ...

EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

Yang Yue, Yulin Wang, Haojun Jiang, Pan Liu, Shiji Song, Gao Huang

TL;DR

EchoWorld tackles the challenge of guiding echocardiography probes by learning motion-aware world models that jointly encode cardiac anatomy and motion-induced visual dynamics. It first pre-trains a cardiac world model using spatial and motion tasks within a JEPA framework, then fine-tunes with a motion-aware attention mechanism that ingests historical visual-motion data to predict probe movements toward ten clinically relevant planes. The approach is validated on a large echocardiography dataset, showing superior accuracy in both single-frame and sequential guidance tasks compared with diverse baselines, and enables visualizations that confirm meaningful attention and learned plane semantics. By integrating world-model representations with motion-conditioned attention, EchoWorld advances embodied medical imaging and holds promise for AI-assisted or autonomous ultrasound scanning. The work provides a principled representation-learning pathway for medical ultrasound that fuses anatomy with dynamic imaging processes, with practical implications for broader access to cardiac care.

Abstract

Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.

EchoWorld: Learning Motion-Aware World Models for Echocardiography Probe Guidance

TL;DR

EchoWorld tackles the challenge of guiding echocardiography probes by learning motion-aware world models that jointly encode cardiac anatomy and motion-induced visual dynamics. It first pre-trains a cardiac world model using spatial and motion tasks within a JEPA framework, then fine-tunes with a motion-aware attention mechanism that ingests historical visual-motion data to predict probe movements toward ten clinically relevant planes. The approach is validated on a large echocardiography dataset, showing superior accuracy in both single-frame and sequential guidance tasks compared with diverse baselines, and enables visualizations that confirm meaningful attention and learned plane semantics. By integrating world-model representations with motion-conditioned attention, EchoWorld advances embodied medical imaging and holds promise for AI-assisted or autonomous ultrasound scanning. The work provides a principled representation-learning pathway for medical ultrasound that fuses anatomy with dynamic imaging processes, with practical implications for broader access to cardiac care.

Abstract

Echocardiography is crucial for cardiovascular disease detection but relies heavily on experienced sonographers. Echocardiography probe guidance systems, which provide real-time movement instructions for acquiring standard plane images, offer a promising solution for AI-assisted or fully autonomous scanning. However, developing effective machine learning models for this task remains challenging, as they must grasp heart anatomy and the intricate interplay between probe motion and visual signals. To address this, we present EchoWorld, a motion-aware world modeling framework for probe guidance that encodes anatomical knowledge and motion-induced visual dynamics, while effectively leveraging past visual-motion sequences to enhance guidance precision. EchoWorld employs a pre-training strategy inspired by world modeling principles, where the model predicts masked anatomical regions and simulates the visual outcomes of probe adjustments. Built upon this pre-trained model, we introduce a motion-aware attention mechanism in the fine-tuning stage that effectively integrates historical visual-motion data, enabling precise and adaptive probe guidance. Trained on more than one million ultrasound images from over 200 routine scans, EchoWorld effectively captures key echocardiographic knowledge, as validated by qualitative analysis. Moreover, our method significantly reduces guidance errors compared to existing visual backbones and guidance frameworks, excelling in both single-frame and sequential evaluation protocols. Code is available at https://github.com/LeapLabTHU/EchoWorld.

Paper Structure

This paper contains 21 sections, 18 equations, 10 figures, 3 tables, 1 algorithm.

Figures (10)

  • Figure 1: Overview of cardiac ultrasound and the probe guidance task. (a) The ultrasound probe captures cross-sectional views of the heart, with variations in probe position and orientation corresponding to different anatomical structures. (b) During the ultrasound scanning process, the sonographer maneuvers the probe on the patient's chest, continuously adjusting its position and orientation based on real-time visual feedback. (c) A probe guidance system can potentially automate the scanning process by predicting the necessary probe movements to reach a target view, utilizing historical visual-motion data.
  • Figure 2: Overview of the proposed framework. Left: We pre-train a cardiac world model to capture ultrasound knowledge through spatial and motion modeling tasks. Right: The pre-trained model is fine-tuned for probe guidance, incorporating a motion-aware attention mechanism to effectively integrate visual-motion features.
  • Figure 3: Illustration of our dataset and task. Top-left: We collect expert demonstration data where the sonographer controls a robot arm with a probe, recording both image frames and probe motion synchronously. Remaining figure: The ten standard planes targeted for acquisition. Figures are adapted from mitchell2019guidelinesjiang2024cardiac.
  • Figure 4: Illustration of the world modeling tasks. (a) A basic world modeling framework lecun2022path, where the task is to predict the target $y$ from context $x$ in feature space, using a latent variable $z$ encoding their relationship. (b) The spatial world modeling task, which recovers masked anatomical structures. (c) The motion world modeling task, which predicts visual changes in the context based on probe motion.
  • Figure 5: The probe guidance pipeline. Given a sequence of historical visual-motion pairs, we first extract features using the pre-trained visual and motion encoders. These features are then integrated via a motion-aware attention mechanism and projected to the final guidance output.
  • ...and 5 more figures