Table of Contents
Fetching ...

EchoLVFM: One-Step Video Generation via Latent Flow Matching for Echocardiogram Synthesis

Emmanuel Oladokun, Sarina Thomas, Jurica Šprem, Vicente Grau

Abstract

Echocardiography is widely used for assessing cardiac function, where clinically meaningful parameters such as left-ventricular ejection fraction (EF) play a central role in diagnosis and management. Generative models capable of synthesising realistic echocardiogram videos with explicit control over such parameters are valuable for data augmentation, counterfactual analysis, and specialist training. However, existing approaches typically rely on computationally expensive multi-step sampling and aggressive temporal normalisation, limiting efficiency and applicability to heterogeneous real-world data. We introduce EchoLVFM, a one-step latent video flow-matching framework for controllable echocardiogram generation. Operating in the latent space, EchoLVFM synthesises temporally coherent videos in a single inference step, achieving a $\mathbf{\sim 50\times}$ improvement in sampling efficiency compared to multi-step flow baselines while maintaining visual fidelity. The model supports global conditioning on clinical variables, demonstrated through precise control of EF, and enables reconstruction and counterfactual generation from partially observed sequences. A masked conditioning strategy further removes fixed-length constraints, allowing shorter sequences to be retained rather than discarded. We evaluate EchoLVFM on the CAMUS dataset under challenging single-frame conditioning. Quantitative and qualitative results demonstrate competitive video quality, strong EF adherence, and 57.9% discrimination accuracy by expert clinicians which is close to chance. These findings indicate that efficient, one-step flow matching can enable practical, controllable echocardiogram video synthesis without sacrificing fidelity. Code available at: https://github.com/EngEmmanuel/EchoLVFM

EchoLVFM: One-Step Video Generation via Latent Flow Matching for Echocardiogram Synthesis

Abstract

Echocardiography is widely used for assessing cardiac function, where clinically meaningful parameters such as left-ventricular ejection fraction (EF) play a central role in diagnosis and management. Generative models capable of synthesising realistic echocardiogram videos with explicit control over such parameters are valuable for data augmentation, counterfactual analysis, and specialist training. However, existing approaches typically rely on computationally expensive multi-step sampling and aggressive temporal normalisation, limiting efficiency and applicability to heterogeneous real-world data. We introduce EchoLVFM, a one-step latent video flow-matching framework for controllable echocardiogram generation. Operating in the latent space, EchoLVFM synthesises temporally coherent videos in a single inference step, achieving a improvement in sampling efficiency compared to multi-step flow baselines while maintaining visual fidelity. The model supports global conditioning on clinical variables, demonstrated through precise control of EF, and enables reconstruction and counterfactual generation from partially observed sequences. A masked conditioning strategy further removes fixed-length constraints, allowing shorter sequences to be retained rather than discarded. We evaluate EchoLVFM on the CAMUS dataset under challenging single-frame conditioning. Quantitative and qualitative results demonstrate competitive video quality, strong EF adherence, and 57.9% discrimination accuracy by expert clinicians which is close to chance. These findings indicate that efficient, one-step flow matching can enable practical, controllable echocardiogram video synthesis without sacrificing fidelity. Code available at: https://github.com/EngEmmanuel/EchoLVFM
Paper Structure (14 sections, 7 equations, 2 figures, 1 table)

This paper contains 14 sections, 7 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: EchoLVFM.Training: Videos are noised and processed in latent space. All but one observed frame are zeroed to form $x_m$, which serves as conditioning. To support variable-length sequences, a padding vector $p$ indicates which frames are valid observations in the temporally augmented input. The target EF $\phi$ is provided as global conditioning. $r$ and $t$ denote timesteps with $r<t$, and the model $u_{\theta}$ learns to predict the conditional average velocity over the interval $[r,t]$. Inference: A partial video containing as little as a single observed frame, together with a target EF and random noise, is passed to the trained model. One-step integration produces the generated video.
  • Figure 2: Qualitative Results. Columns 1-5 show frames sampled between and including ED and ES, while column 6 presents the M-mode slice of the middle row over time.