ECHOPulse: ECG controlled echocardio-grams video generation

Yiwei Li; Sekeun Kim; Zihao Wu; Hanqi Jiang; Yi Pan; Pengfei Jin; Sifan Song; Yucheng Shi; Tianming Liu; Quanzheng Li; Xiang Li

ECHOPulse: ECG controlled echocardio-grams video generation

Yiwei Li, Sekeun Kim, Zihao Wu, Hanqi Jiang, Yi Pan, Pengfei Jin, Sifan Song, Yucheng Shi, Tianming Liu, Quanzheng Li, Xiang Li

TL;DR

ECHOPulse tackles the challenge of producing controllable, high-quality echocardiogram videos without expert conditional prompts by conditioning video generation on ECG signals. It introduces a fast, token-based pipeline using VQ-VAE video tokenization and a masked generative transformer guided by ECG embeddings, with LFQ and LoRA to improve efficiency and domain adaptation. The approach achieves state-of-the-art results on multiple datasets in both qualitative and quantitative metrics, and demonstrates the potential for zero-shot generalization to real-world ECG inputs (e.g., wearable devices) and extension to other cardiac-imaging modalities. This work has practical implications for scalable synthetic data generation, real-time clinical monitoring, and broader modality-generalizable video synthesis in medical imaging.

Abstract

Echocardiography (ECHO) is essential for cardiac assessments, but its video quality and interpretation heavily relies on manual expertise, leading to inconsistent results from clinical and portable devices. ECHO video generation offers a solution by improving automated monitoring through synthetic data and generating high-quality videos from routine health data. However, existing models often face high computational costs, slow inference, and rely on complex conditional prompts that require experts' annotations. To address these challenges, we propose ECHOPULSE, an ECG-conditioned ECHO video generation model. ECHOPULSE introduces two key advancements: (1) it accelerates ECHO video generation by leveraging VQ-VAE tokenization and masked visual token modeling for fast decoding, and (2) it conditions on readily accessible ECG signals, which are highly coherent with ECHO videos, bypassing complex conditional prompts. To the best of our knowledge, this is the first work to use time-series prompts like ECG signals for ECHO video generation. ECHOPULSE not only enables controllable synthetic ECHO data generation but also provides updated cardiac function information for disease monitoring and prediction beyond ECG alone. Evaluations on three public and private datasets demonstrate state-of-the-art performance in ECHO video generation across both qualitative and quantitative measures. Additionally, ECHOPULSE can be easily generalized to other modality generation tasks, such as cardiac MRI, fMRI, and 3D CT generation. Demo can seen from \url{https://github.com/levyisthebest/ECHOPulse_Prelease}.

ECHOPulse: ECG controlled echocardio-grams video generation

TL;DR

Abstract

Paper Structure (27 sections, 11 equations, 7 figures, 5 tables)

This paper contains 27 sections, 11 equations, 7 figures, 5 tables.

Introduction
RELATED WORK
Visual Tokenization
Video Generation
ECHO Video Generation
METHODS
Video Tokenization
Model architecture:
Loss design:
Alignment between Video and ECG Tokens
ECG encoder:
Aligning ECG and video token sequences for video prediction:
Optimization:
Video Generation
Progressive video generation with auto-regressive extrapolation:
...and 12 more sections

Figures (7)

Figure 1: The pipeline of the ECHOPulse. ECHOPulse contains a two-step training procedure. a) The first step trains the video tokenizer on the natural video dataset first and fine-tunes it on the public ECHO video dataset. b) The second step trains the transformer via the input token produced by ECG foundation model and video tokenizer, pretrained in the first step. Followed by the video generation procedure (c), ECHOPulse accepts input with or without a conditional image. The empty tokens will be reconstructed via the frozen transformer, trained in the second step, through the guidance of ECG siganls. ECHOPulse is capable of generating continuous long videos by sequentially shifting the token sequence and integrating new ECG inputs.
Figure 2: Video generation example of ECHOPulse. The inputs to ECHOPulse consist of a conditioning image and an ECG signal. Utilizing these inputs as constraints, ECHOPulse generates corresponding videos. The RSTP waves represent the four phases of the ECG. The R wave corresponds to end-diastole (ED), which is the frame with the largest ventricular segmentation area, while the T wave corresponds to end-systole (ES), the frame with the smallest ventricular area. In this case the EF of the generated ECHO video is 24.05.
Figure 3: Video generation results from ECHOPulse under different ECG conditions. The first row displays the ground truth, while the second row shows the generated video using the same ECG. The third row illustrates results under flat ECG conditions. The fourth condition depicts a shifted R wave, and the final condition, collected from an Apple Watch, represents the most common scenario. The ECG for Condition 4 was collected directly from the Health app on Apple Watch v9.
Figure 4: A2C video generation example from ECHOPulse.
Figure 5: A2C video generation example from ECHOPulse.
...and 2 more figures

ECHOPulse: ECG controlled echocardio-grams video generation

TL;DR

Abstract

ECHOPulse: ECG controlled echocardio-grams video generation

Authors

TL;DR

Abstract

Table of Contents

Figures (7)