Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis

Van Phi Nguyen; Tri Nhan Luong Ha; Huy Hieu Pham; Quoc Long Tran

Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis

Van Phi Nguyen, Tri Nhan Luong Ha, Huy Hieu Pham, Quoc Long Tran

TL;DR

This paper tackles the data scarcity challenge in echocardiogram synthesis by proposing a training-free conditional video diffusion framework that can generate realistic echocardiograms from a single end-diastolic segmentation map. The method, Free-Echo, builds on a 3D-U-Net denoiser and employs a training-free conditioning via SDEdit, where the reverse diffusion starts from a noisy pseudo-video $\hat{V}^K$ produced from an optimal-transport mapped image $\hat{I}_0$ derived from the segmentation map. The pseudo-video is created by solving an OT problem between a label-to-intensity mapping $\hat{I}_0$ and real video frames, and the reverse process iterates over $64$ diffusion steps starting at $t_i=15$. Evaluations on CAMUS and EchoNet-Dynamic show that Free-Echo attains plausible, spatially aligned echocardiograms with performance approaching training-based CDMs (e.g., about a $10\%$ drop in SSIM/PSNR/L2 and roughly a $20\%$ increase in FID/FVD), and outperforms SDEdit. The work highlights potential for data augmentation and domain adaptation using single-segmentation inputs, while noting limitations in resolution and temporal duration and sensitivity to the diffusion starting point $t_i$, pointing to future work on robustness and downstream validation.

Abstract

Conditional video diffusion models (CDM) have shown promising results for video synthesis, potentially enabling the generation of realistic echocardiograms to address the problem of data scarcity. However, current CDMs require a paired segmentation map and echocardiogram dataset. We present a new method called Free-Echo for generating realistic echocardiograms from a single end-diastolic segmentation map without additional training data. Our method is based on the 3D-Unet with Temporal Attention Layers model and is conditioned on the segmentation map using a training-free conditioning method based on SDEdit. We evaluate our model on two public echocardiogram datasets, CAMUS and EchoNet-Dynamic. We show that our model can generate plausible echocardiograms that are spatially aligned with the input segmentation map, achieving performance comparable to training-based CDMs. Our work opens up new possibilities for generating echocardiograms from a single segmentation map, which can be used for data augmentation, domain adaptation, and other applications in medical imaging. Our code is available at \url{https://github.com/gungui98/echo-free}

Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis

TL;DR

produced from an optimal-transport mapped image

derived from the segmentation map. The pseudo-video is created by solving an OT problem between a label-to-intensity mapping

and real video frames, and the reverse process iterates over

diffusion steps starting at

. Evaluations on CAMUS and EchoNet-Dynamic show that Free-Echo attains plausible, spatially aligned echocardiograms with performance approaching training-based CDMs (e.g., about a

drop in SSIM/PSNR/L2 and roughly a

increase in FID/FVD), and outperforms SDEdit. The work highlights potential for data augmentation and domain adaptation using single-segmentation inputs, while noting limitations in resolution and temporal duration and sensitivity to the diffusion starting point

, pointing to future work on robustness and downstream validation.

Abstract

Paper Structure (5 sections, 4 equations, 2 figures, 1 table)

This paper contains 5 sections, 4 equations, 2 figures, 1 table.

Introduction
Method
Experiment Settings
Conclusion
Acknowledgments.

Figures (2)

Figure 1: Illustration of our method. Given a single end-diastolic segmentation map $m_0$, we first solve the optimal transport problem to obtain the pseudo-image $\hat{I}_0$. We then start the reverse process of the DM from the diffusion step $t$ with the noisy version of the pseudo video $\hat{V}^K$, obtained by adding Gaussian noise to the pseudo image. The reverse process is continued until the diffusion step $t = 0$, at which point we obtain the generated echocardiogram $\hat{x}^K$.
Figure 2: Visual comparison of our model with the CDM and the ground truth echocardiograms trained on the CAMUS (blue) and EchoNet-Dynamic (green) datasets.

Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis

TL;DR

Abstract

Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis

Authors

TL;DR

Abstract

Table of Contents

Figures (2)