EvDiff: High Quality Video with an Event Camera

Weilun Li; Lei Sun; Ruixi Gao; Qi Jiang; Yuqin Ma; Kaiwei Wang; Ming-Hsuan Yang; Luc Van Gool; Danda Pani Paudel

EvDiff: High Quality Video with an Event Camera

Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel

TL;DR

EvDiff addresses the ill-posed problem of reconstructing high-quality color video from monochrome event streams by leveraging a one-step diffusion prior and a dedicated EvEncoder. A Surrogate Training Pipeline bridges scarce event–RGB data with large-scale image datasets through an E2VID-style Degradation Model and staged training (Stage 1 DiT, Stage 2 distillation, Stage 3 joint fine-tuning), with a fixed diffusion timestep $t^*$ guiding one-step refinement via $\\hat{\boldsymbol{z}} = (\boldsymbol{z} - \beta_{t^*} \\epsilon(\boldsymbol{z}; t^*))/\alpha_{t^*}$ and $\\hat{\mathbf{I}} = \mathcal{D}(\\hat{\boldsymbol{z}})$. Experiments on BS-ERGB and DSEC show EvDiff achieves state-of-the-art fidelity and perceptual realism (lower MSE/LPIPS, higher SSIM, and better FID/FVD) while producing chromatic outputs from monochrome events, and it outperforms ControlNet-based baselines while being more efficient. By combining diffusion priors with an efficient temporal encoder, EvDiff scales to large pretrained models and large image datasets, offering practical high-quality event-to-video synthesis with HDR-like richness.

Abstract

As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

EvDiff: High Quality Video with an Event Camera

TL;DR

Abstract

EvDiff: High Quality Video with an Event Camera

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)