Training-Free Condition Video Diffusion Models for single frame Spatial-Semantic Echocardiogram Synthesis
Van Phi Nguyen, Tri Nhan Luong Ha, Huy Hieu Pham, Quoc Long Tran
TL;DR
This paper tackles the data scarcity challenge in echocardiogram synthesis by proposing a training-free conditional video diffusion framework that can generate realistic echocardiograms from a single end-diastolic segmentation map. The method, Free-Echo, builds on a 3D-U-Net denoiser and employs a training-free conditioning via SDEdit, where the reverse diffusion starts from a noisy pseudo-video $\hat{V}^K$ produced from an optimal-transport mapped image $\hat{I}_0$ derived from the segmentation map. The pseudo-video is created by solving an OT problem between a label-to-intensity mapping $\hat{I}_0$ and real video frames, and the reverse process iterates over $64$ diffusion steps starting at $t_i=15$. Evaluations on CAMUS and EchoNet-Dynamic show that Free-Echo attains plausible, spatially aligned echocardiograms with performance approaching training-based CDMs (e.g., about a $10\%$ drop in SSIM/PSNR/L2 and roughly a $20\%$ increase in FID/FVD), and outperforms SDEdit. The work highlights potential for data augmentation and domain adaptation using single-segmentation inputs, while noting limitations in resolution and temporal duration and sensitivity to the diffusion starting point $t_i$, pointing to future work on robustness and downstream validation.
Abstract
Conditional video diffusion models (CDM) have shown promising results for video synthesis, potentially enabling the generation of realistic echocardiograms to address the problem of data scarcity. However, current CDMs require a paired segmentation map and echocardiogram dataset. We present a new method called Free-Echo for generating realistic echocardiograms from a single end-diastolic segmentation map without additional training data. Our method is based on the 3D-Unet with Temporal Attention Layers model and is conditioned on the segmentation map using a training-free conditioning method based on SDEdit. We evaluate our model on two public echocardiogram datasets, CAMUS and EchoNet-Dynamic. We show that our model can generate plausible echocardiograms that are spatially aligned with the input segmentation map, achieving performance comparable to training-based CDMs. Our work opens up new possibilities for generating echocardiograms from a single segmentation map, which can be used for data augmentation, domain adaptation, and other applications in medical imaging. Our code is available at \url{https://github.com/gungui98/echo-free}
