Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

Hadrien Reynaud; Mengyun Qiao; Mischa Dombrowski; Thomas Day; Reza Razavi; Alberto Gomez; Paul Leeson; Bernhard Kainz

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

Hadrien Reynaud, Mengyun Qiao, Mischa Dombrowski, Thomas Day, Reza Razavi, Alberto Gomez, Paul Leeson, Bernhard Kainz

TL;DR

The study tackles the challenge of generating clinically useful ultrasound video data from limited inputs by proposing an image-conditioned cascaded diffusion framework that omits text prompts and directly uses a single frame $I_c$ and a clinical parameter $\lambda_c$ to synthesize controllable echocardiogram videos. It demonstrates on the EchoNet-Dynamic dataset that higher temporal resolution via cascaded stages (e.g., 2SCM with temporal upsampling) delivers strong counterfactual performance, with $R^2$ reaching up to $0.93$, representing a substantial improvement over prior sequence-to-sequence methods. The work also shows practical benefits for data augmentation in regression tasks and provides qualitative expert validation, suggesting that the approach can support training, evaluation, and downstream analyses in medical imaging. Overall, this approach points toward foundation-model-like, controllable medical video generation with potential for extension to other organs and conditioning modalities. $R^2$ and related metrics are used to quantify accuracy and alignment with clinical parameters, while image quality is assessed via SSIM, LPIPS, FID, and FVD, underscoring a favorable balance between fidelity and controllability.

Abstract

Image synthesis is expected to provide value for the translation of machine learning methods into clinical practice. Fundamental problems like model robustness, domain transfer, causal modelling, and operator training become approachable through synthetic data. Especially, heavily operator-dependant modalities like Ultrasound imaging require robust frameworks for image and video generation. So far, video generation has only been possible by providing input data that is as rich as the output data, e.g., image sequence plus conditioning in, video out. However, clinical documentation is usually scarce and only single images are reported and stored, thus retrospective patient-specific analysis or the generation of rich training data becomes impossible with current approaches. In this paper, we extend elucidated diffusion models for video modelling to generate plausible video sequences from single images and arbitrary conditioning with clinical parameters. We explore this idea within the context of echocardiograms by looking into the variation of the Left Ventricle Ejection Fraction, the most essential clinical metric gained from these examinations. We use the publicly available EchoNet-Dynamic dataset for all our experiments. Our image to sequence approach achieves an $R^2$ score of 93%, which is 38 points higher than recently proposed sequence to sequence generation methods. Code and models will be available at: https://github.com/HReynaud/EchoDiffusion.

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

TL;DR

and a clinical parameter

to synthesize controllable echocardiogram videos. It demonstrates on the EchoNet-Dynamic dataset that higher temporal resolution via cascaded stages (e.g., 2SCM with temporal upsampling) delivers strong counterfactual performance, with

reaching up to

, representing a substantial improvement over prior sequence-to-sequence methods. The work also shows practical benefits for data augmentation in regression tasks and provides qualitative expert validation, suggesting that the approach can support training, evaluation, and downstream analyses in medical imaging. Overall, this approach points toward foundation-model-like, controllable medical video generation with potential for extension to other organs and conditioning modalities.

and related metrics are used to quantify accuracy and alignment with clinical parameters, while image quality is assessed via SSIM, LPIPS, FID, and FVD, underscoring a favorable balance between fidelity and controllability.

Abstract

score of 93%, which is 38 points higher than recently proposed sequence to sequence generation methods. Code and models will be available at: https://github.com/HReynaud/EchoDiffusion.

Paper Structure (4 sections, 2 equations, 2 figures, 2 tables)

This paper contains 4 sections, 2 equations, 2 figures, 2 tables.

Introduction
Method
Experiments
Conclusion

Figures (2)

Figure 1: Summarized view of our Model. Inputs (blue): a noised sample $x_i$, a diffusion step $t_i$, one anatomy image $I_c$, and one LVEF $\lambda_c$. Output (red): a slightly denoised version of $x_i$ named $x_{i+1}$. See Appendix Fig. 1 for more details.
Figure 2: Top: Ground truth frames with 29.3% . Middle: Generated factual frames, with estimated 27.9% . Bottom: Generated counterfactual frames, with estimated 64.0% . (Counter-)Factual frames are generated with the 1SCM, conditioned on the ground-truth anatomy.

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

TL;DR

Abstract

Feature-Conditioned Cascaded Video Diffusion Models for Precise Echocardiogram Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)