E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

Jinxiu Liang; Bohan Yu; Yixin Yang; Yiming Han; Boxin Shi

E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

Jinxiu Liang, Bohan Yu, Yixin Yang, Yiming Han, Boxin Shi

TL;DR

E2VIDiff tackles the ill-posed task of reconstructing color video from achromatic events by casting it as conditional diffusion-based video generation. It leverages pretrained video diffusion priors, a spatiotemporally factorized event encoder, an event-to-frame fusion module, and an event-guided sampling mechanism to ensure fidelity to the input events while producing diverse, photorealistic frames. The approach achieves superior perceptual quality and temporal coherence, enabling effective downstream tasks such as semantic segmentation and optical-flow estimation directly from event-derived reconstructions. While demonstrating strong performance on multiple datasets, it acknowledges limitations in extremely high-frame-rate motion due to data scarcity and diffusion-time costs, and highlights potential for broader diffusion-based video priors in future work.

Abstract

Event cameras, mimicking the human retina, capture brightness changes with unparalleled temporal resolution and dynamic range. Integrating events into intensities poses a highly ill-posed challenge, marred by initial condition ambiguities. Traditional regression-based deep learning methods fall short in perceptual quality, offering deterministic and often unrealistic reconstructions. In this paper, we introduce diffusion models to events-to-video reconstruction, achieving colorful, realistic, and perceptually superior video generation from achromatic events. Powered by the image generation ability and knowledge of pretrained diffusion models, the proposed method can achieve a better trade-off between the perception and distortion of the reconstructed frame compared to previous solutions. Extensive experiments on benchmark datasets demonstrate that our approach can produce diverse, realistic frames with faithfulness to the given events.

E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

TL;DR

Abstract

Paper Structure (24 sections, 14 equations, 10 figures, 2 tables, 1 algorithm)

This paper contains 24 sections, 14 equations, 10 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Event-to-video reconstruction.
Conditional diffusion models.
Method
Problem definition
Events-to-video diffusion models
Event-guided sampling
Network architecture
Experiments
Experimental setup
Dataset.
Metrics.
Baselines.
Implementation details.
...and 9 more sections

Figures (10)

Figure 1: Visual comparison results of state-of-the-art events-to-video methods E2VID+ stoffregen2020reducinga, ETNet weng2021eventbased and the proposed E2VIDiff. Given achromatic events, our approach generates high-quality chromatic videos that are perceptually accurate and aligned with the input events, offering photorealism, vivid coloration, and diversity. Please refer to the supplementary video for video reconstruction comparisons.
Figure 2: Schematic of the proposed E2VIDiff. (a) The process begins with the latent frame representation $x_0$ derived from frame $L$, which undergoes a forward diffusion via the transition kernel expressed in Eq. \ref{['eq:forward']}. Conversely, backward diffusion begins from pure Gaussian noise $x_T\sim\mathcal{N}(0,I)$ and iteratively reconstructs a predicted sample $x_0\sim p(x|c)$ by a backward diffusion expressed in Eqs. \ref{['eq:ddim']} and \ref{['eq:x0']}. (b) The proposed event-guided sampling mechanism iteratively enhances the fidelity and perceptual quality of the generated frame by alternating between denoising (to align with natural image characteristics) and ensuring the predicted sample $x_{0|\tau}$ is consistent with the physical formulation of the given events.
Figure 3: The network architecture of the denoising U-Net $\epsilon_\theta$ in the proposed E2VIDiff. (a) Overview of the denoising process utilizing the denoising U-Net $\epsilon_\theta$, featuring dual branches with weights initialized from pretrained image diffusion models. (b) The event extractor $\mathcal{E}_\text{E}$ is designed to extract conditioning information from the event stream. (c) The event-to-frame integration module $\mathcal{F}_\text{EFI}$ is designed to merge event features with the output of the frozen image diffusion models. (d) The temporal modulation module $\mathcal{F}_\text{TM}$ ensures temporal consistency in the generated frames.
Figure 4: Qualitative comparisons on real events captured by DAVIS240 mueggler2017eventcamera (top) and DAVIS346B zhu2018multia (bottom) event cameras, with corresponding achromatic frames captured concurrently as references. The compared state-of-the-art methods include: E2VID rebecq2019eventstovideo, FireNet scheerlinck2020fast, FireNet+ stoffregen2020reducinga, E2VID+ stoffregen2020reducinga, SSL-E2VID paredes-valles2021back, EV-SNN zhu2022eventbased, and ETNet weng2021eventbased.
Figure 5: Qualitative comparisons on real events captured by Prophesee Gen3 gehrig2021dsec event cameras, accompanied by chromatic frames captured in stereo alongside the events, serving as unaligned references. Please refer to the caption of Fig. \ref{['fig:fig_sota_ijrr2_mvsec1']} for a complete list of the compared methods.
...and 5 more figures

E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

TL;DR

Abstract

E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors

Authors

TL;DR

Abstract

Table of Contents

Figures (10)