Table of Contents
Fetching ...

DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

Jiyun Kong, Jun-Hyuk Kim, Jong-Seok Lee

TL;DR

DESSERT tackles the challenge of predicting future video frames by leveraging asynchronous event data and a pre-trained diffusion prior. It introduces a two-stage training pipeline: an ER-VAE that aligns event representations with inter-frame residual latents, and an event-conditioned diffusion model that denoises these residual latents to synthesize the next frame, guided by both image and event signals. The Diverse-Length Temporal augmentation further improves robustness to varying motion scales. Empirical results on real and synthetic datasets demonstrate sharper, more temporally coherent frame synthesis with state-of-the-art quantitative scores and favorable qualitative comparisons, albeit with higher inference cost due to diffusion steps. The work highlights the potential of residual-centric diffusion conditioning for event-based video synthesis and points to future directions in efficiency and multi-frame generation.

Abstract

Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

DESSERT: Diffusion-based Event-driven Single-frame Synthesis via Residual Training

TL;DR

DESSERT tackles the challenge of predicting future video frames by leveraging asynchronous event data and a pre-trained diffusion prior. It introduces a two-stage training pipeline: an ER-VAE that aligns event representations with inter-frame residual latents, and an event-conditioned diffusion model that denoises these residual latents to synthesize the next frame, guided by both image and event signals. The Diverse-Length Temporal augmentation further improves robustness to varying motion scales. Empirical results on real and synthetic datasets demonstrate sharper, more temporally coherent frame synthesis with state-of-the-art quantitative scores and favorable qualitative comparisons, albeit with higher inference cost due to diffusion steps. The work highlights the potential of residual-centric diffusion conditioning for event-based video synthesis and points to future directions in efficiency and multi-frame generation.

Abstract

Video frame prediction extrapolates future frames from previous frames, but suffers from prediction errors in dynamic scenes due to the lack of information about the next frame. Event cameras address this limitation by capturing per-pixel brightness changes asynchronously with high temporal resolution. Prior research on event-based video frame prediction has leveraged motion information from event data, often by predicting event-based optical flow and reconstructing frames via pixel warping. However, such approaches introduce holes and blurring when pixel displacement is inaccurate. To overcome this limitation, we propose DESSERT, a diffusion-based event-driven single-frame synthesis framework via residual training. Leveraging a pre-trained Stable Diffusion model, our method is trained on inter-frame residuals to ensure temporal consistency. The training pipeline consists of two stages: (1) an Event-to-Residual Alignment Variational Autoencoder (ER-VAE) that aligns the event frame between anchor and target frames with the corresponding residual, and (2) a diffusion model that denoises the residual latent conditioned on event data. Furthermore, we introduce Diverse-Length Temporal (DLT) augmentation, which improves robustness by training on frame segments of varying temporal lengths. Experimental results demonstrate that our method outperforms existing event-based reconstruction, image-based video frame prediction, event-based video frame prediction, and one-sided event-based video frame interpolation methods, producing sharper and more temporally consistent frame synthesis.

Paper Structure

This paper contains 25 sections, 2 equations, 18 figures, 10 tables.

Figures (18)

  • Figure 1:
  • Figure 2: Training pipeline of DESSERT. The framework consists of two stages: In stage 1, ER-VAE is optimized with the loss $\mathcal{L}_\text{E2R}$, which jointly enforces latent alignment between the event latent $z_\text{event}$ and the residual latent $z_\text{res} = z_{t+1} - z_t$. In stage 2, a diffusion model is trained with the denoising objective $\mathcal{L}_\text{DM}$, which predicts the denoised residual latent $\hat{z}_\text{res}$ under event-based conditioning.
  • Figure 3: Structural relationship between event and residual representations. The decoded residual latent (d) is derived from the anchor frame (a) and target frame (b) as the difference between them. The event frame (c) reveals that event data captures inter-frame motion consistent with latent residuals.
  • Figure 4: Qualitative comparison on BS-ERGB tulyakov2022timelenspp. Our method produces clearer motion with noticeably reduced ghosting artifacts compared to VFPSIE zhu2024video (event-based video frame prediction) and CBMNet-Large kim2023cbmnet (event-based video frame interpolation, one-side prediction), resulting in more precise action depiction. Although CBMNet-Large is the second-best model on the BS-ERGB (1-frame prediction) benchmark, subtle motion blur remains visible in challenging regions.
  • Figure 5: Qualitative comparison on GoPro nah2017deep. We compare our method against VFPSIE zhu2024video (event-based video frame prediction), RE-VDM chen2025repurposing (event-based video frame interpolation, one-side prediction), and bFlow gehrig2024dense (Flow Estimation). RE-VDM is the second-best model on the GoPro 7-frame prediction benchmark. The figure shows the 1st, 3rd, and 5th predicted frames in the 7-frame generation sequence. DESSERT preserves stronger video-level temporal consistency, maintaining sharper text details and more stable object positioning across consecutive predictions.
  • ...and 13 more figures