Table of Contents
Fetching ...

EvDiff: High Quality Video with an Event Camera

Weilun Li, Lei Sun, Ruixi Gao, Qi Jiang, Yuqin Ma, Kaiwei Wang, Ming-Hsuan Yang, Luc Van Gool, Danda Pani Paudel

TL;DR

EvDiff addresses the ill-posed problem of reconstructing high-quality color video from monochrome event streams by leveraging a one-step diffusion prior and a dedicated EvEncoder. A Surrogate Training Pipeline bridges scarce event–RGB data with large-scale image datasets through an E2VID-style Degradation Model and staged training (Stage 1 DiT, Stage 2 distillation, Stage 3 joint fine-tuning), with a fixed diffusion timestep $t^*$ guiding one-step refinement via $\\hat{\boldsymbol{z}} = (\boldsymbol{z} - \beta_{t^*} \\epsilon(\boldsymbol{z}; t^*))/\alpha_{t^*}$ and $\\hat{\mathbf{I}} = \mathcal{D}(\\hat{\boldsymbol{z}})$. Experiments on BS-ERGB and DSEC show EvDiff achieves state-of-the-art fidelity and perceptual realism (lower MSE/LPIPS, higher SSIM, and better FID/FVD) while producing chromatic outputs from monochrome events, and it outperforms ControlNet-based baselines while being more efficient. By combining diffusion priors with an efficient temporal encoder, EvDiff scales to large pretrained models and large image datasets, offering practical high-quality event-to-video synthesis with HDR-like richness.

Abstract

As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

EvDiff: High Quality Video with an Event Camera

TL;DR

EvDiff addresses the ill-posed problem of reconstructing high-quality color video from monochrome event streams by leveraging a one-step diffusion prior and a dedicated EvEncoder. A Surrogate Training Pipeline bridges scarce event–RGB data with large-scale image datasets through an E2VID-style Degradation Model and staged training (Stage 1 DiT, Stage 2 distillation, Stage 3 joint fine-tuning), with a fixed diffusion timestep guiding one-step refinement via and . Experiments on BS-ERGB and DSEC show EvDiff achieves state-of-the-art fidelity and perceptual realism (lower MSE/LPIPS, higher SSIM, and better FID/FVD) while producing chromatic outputs from monochrome events, and it outperforms ControlNet-based baselines while being more efficient. By combining diffusion priors with an efficient temporal encoder, EvDiff scales to large pretrained models and large image datasets, offering practical high-quality event-to-video synthesis with HDR-like richness.

Abstract

As neuromorphic sensors, event cameras asynchronously record changes in brightness as streams of sparse events with the advantages of high temporal resolution and high dynamic range. Reconstructing intensity images from events is a highly ill-posed task due to the inherent ambiguity of absolute brightness. Early methods generally follow an end-to-end regression paradigm, directly mapping events to intensity frames in a deterministic manner. While effective to some extent, these approaches often yield perceptually inferior results and struggle to scale up in model capacity and training data. In this work, we propose EvDiff, an event-based diffusion model that follows a surrogate training framework to produce high-quality videos. To reduce the heavy computational cost of high-frame-rate video generation, we design an event-based diffusion model that performs only a single forward diffusion step, equipped with a temporally consistent EvEncoder. Furthermore, our novel Surrogate Training Framework eliminates the dependence on paired event-image datasets, allowing the model to leverage large-scale image datasets for higher capacity. The proposed EvDiff is capable of generating high-quality colorful videos solely from monochromatic event streams. Experiments on real-world datasets demonstrate that our method strikes a sweet spot between fidelity and realism, outperforming existing approaches on both pixel-level and perceptual metrics.

Paper Structure

This paper contains 18 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Our EvDiff can reconstruct real high-quality video streams from monochrome event streams, while maintaining both fidelity and realism. Compared with Ground-Truth (GT), our result shows a higher dynamic range.
  • Figure 2: The proposed Surrogate Training Pipeline. Stage 1: We train a DiT model and Surrogate VAE Encoder with LQ-HQ pairs; Stage 2: The Surrogate VAE Encoder is distilled into EvEncoder; Stage 3: We finetune the whole EvDiff model.
  • Figure 3: Overview of the proposed E2VID-Style Degradation Model. HQ images sampled from the Places365 places365 dataset are cropped to $512 \times 512$ and then processed by our degradation model to generate corresponding LQ images. For comparison, E2VID-style results from the BS-ERGB dataset are shown on the right.
  • Figure 4: Visual comparison on BS-ERGB tulyakov2022time and DSEC dsec datasets. Our EvDiff produces higher-quality chromatic frames.
  • Figure 5: Visual comparison agains multi-steps ControlNet-based counterparts. Left four columns: Results from ControlNet-based methods. "GT Prompt": Prompts are produced from GT images with RAM ram. Our EvDiff produces more faithful results.