E2VIDiff: Perceptual Events-to-Video Reconstruction using Diffusion Priors
Jinxiu Liang, Bohan Yu, Yixin Yang, Yiming Han, Boxin Shi
TL;DR
E2VIDiff tackles the ill-posed task of reconstructing color video from achromatic events by casting it as conditional diffusion-based video generation. It leverages pretrained video diffusion priors, a spatiotemporally factorized event encoder, an event-to-frame fusion module, and an event-guided sampling mechanism to ensure fidelity to the input events while producing diverse, photorealistic frames. The approach achieves superior perceptual quality and temporal coherence, enabling effective downstream tasks such as semantic segmentation and optical-flow estimation directly from event-derived reconstructions. While demonstrating strong performance on multiple datasets, it acknowledges limitations in extremely high-frame-rate motion due to data scarcity and diffusion-time costs, and highlights potential for broader diffusion-based video priors in future work.
Abstract
Event cameras, mimicking the human retina, capture brightness changes with unparalleled temporal resolution and dynamic range. Integrating events into intensities poses a highly ill-posed challenge, marred by initial condition ambiguities. Traditional regression-based deep learning methods fall short in perceptual quality, offering deterministic and often unrealistic reconstructions. In this paper, we introduce diffusion models to events-to-video reconstruction, achieving colorful, realistic, and perceptually superior video generation from achromatic events. Powered by the image generation ability and knowledge of pretrained diffusion models, the proposed method can achieve a better trade-off between the perception and distortion of the reconstructed frame compared to previous solutions. Extensive experiments on benchmark datasets demonstrate that our approach can produce diverse, realistic frames with faithfulness to the given events.
