Table of Contents
Fetching ...

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Gang Xu, Zhiyu Zhu, Junhui Hou

TL;DR

This paper establishes a baseline model by directly applying event data as a condition to synthesize videos and introduces the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction, thereby creating a unified event-to-frame reconstruction framework.

Abstract

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

TL;DR

This paper establishes a baseline model by directly applying event data as a condition to synthesize videos and introduces the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction, thereby creating a unified event-to-frame reconstruction framework.

Abstract

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.
Paper Structure (20 sections, 1 theorem, 28 equations, 18 figures, 13 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 28 equations, 18 figures, 13 tables, 1 algorithm.

Key Result

Proposition 1

The gradient term $\nabla_{\mathbf{U}^t} \mathcal{L}_{\text{residual}}(\mathbf{U}^t)$ derived from the inter-frame residual guidance lies in the tangent space $T_{\mathbf{U}^t}\mathcal{M}$ of the data manifold $\mathcal{M}$ learned by the diffusion model. Then, we have the following characteristics:

Figures (18)

  • Figure 1: Illustration of the forward and backward diffusion processes for our UniE2F under the conditional event data. The right and left parts indicate the inputs and results of our algorithm, while in the central plot, the solid and dashed lines with the same color represent the reverse-time sampling SDE and ODE trajectories under the same setting, respectively. The proposed method can adapt to different types of event-assisted frame reconstruction tasks. (A) event-based frame reconstruction: with input of only event, to reconstruct RGB frame; (B) frame prediction: with input of both event and the first frame to reconstruct the remaining frames; and (C) frame interpolation: with input of event and the first and last frames to reconstruct the intermediate frames. (D) denotes the ground-truth frames corresponding to the conditional event data.
  • Figure 2: The schematic of the proposed framework, which integrates event-based inter-frame residual guidance during the inference stage. At step $t$ ($t \leq \tau$), given event representations, we utilize a ResNet to predict the inter-frame residuals between consecutive frames. Then, these residuals are utilized to formulate the inter-frame residual loss $\mathcal{L}_\text{residual}$, which is optimized via a gradient descent algorithm to update noisy latent.
  • Figure 3: Visual comparison of event-based video frame reconstruction results on synthetic (1st and 2nd rows) and real-world (3rd and 4th rows) datasets. The event data is visualized as a polarity map.
  • Figure 4: Visual comparison on synthetic dataset: UniE2F is shown in both video frame interpolation (VFI) and video frame prediction (VFP) modes, while CBMNet is shown under the VFI setting. Note that the frames highlighted with purple borders denote the given frames, whereas frames highlighted with blue borders denote the predicted frames.
  • Figure 5: Qualitative Comparison on HQF stoffregen2020reducing, IJRR mueggler2017event, and MVSEC zhu2018multivehicle.
  • ...and 13 more figures

Theorems & Definitions (2)

  • Proposition 1
  • proof