UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Gang Xu; Zhiyu Zhu; Junhui Hou

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Gang Xu, Zhiyu Zhu, Junhui Hou

TL;DR

This paper establishes a baseline model by directly applying event data as a condition to synthesize videos and introduces the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction, thereby creating a unified event-to-frame reconstruction framework.

Abstract

Event cameras excel at high-speed, low-power, and high-dynamic-range scene perception. However, as they fundamentally record only relative intensity changes rather than absolute intensity, the resulting data streams suffer from a significant loss of spatial information and static texture details. In this paper, we address this limitation by leveraging the generative prior of a pre-trained video diffusion model to reconstruct high-fidelity video frames from sparse event data. Specifically, we first establish a baseline model by directly applying event data as a condition to synthesize videos. Then, based on the physical correlation between the event stream and video frames, we further introduce the event-based inter-frame residual guidance to enhance the accuracy of video frame reconstruction. Furthermore, we extend our method to video frame interpolation and prediction in a zero-shot manner by modulating the reverse diffusion sampling process, thereby creating a unified event-to-frame reconstruction framework. Experimental results on real-world and synthetic datasets demonstrate that our method significantly outperforms previous approaches both quantitatively and qualitatively. We also refer the reviewers to the video demo contained in the supplementary material for video results. The code will be publicly available at https://github.com/CS-GangXu/UniE2F.

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

TL;DR

Abstract

Paper Structure (20 sections, 1 theorem, 28 equations, 18 figures, 13 tables, 1 algorithm)

This paper contains 20 sections, 1 theorem, 28 equations, 18 figures, 13 tables, 1 algorithm.

Introduction
Related Work
Preliminary
Proposed Method
Fine-tuning with Event Representation
Inter-Frame Residual Guidance
Adaptation to Video Frame Interpolation and Prediction
Experiment
Experiment Settings
Results of Event-based Frame Reconstruction
Results of Video Frame Interpolation and Prediction
Ablation Study
Conclusion and Discussion
Proofs of Proposition 1
Gradient Alignment with the Tangent Space & Manifold-preserved Sampling
...and 5 more sections

Key Result

Proposition 1

The gradient term $\nabla_{\mathbf{U}^t} \mathcal{L}_{\text{residual}}(\mathbf{U}^t)$ derived from the inter-frame residual guidance lies in the tangent space $T_{\mathbf{U}^t}\mathcal{M}$ of the data manifold $\mathcal{M}$ learned by the diffusion model. Then, we have the following characteristics:

Figures (18)

Figure 1: Illustration of the forward and backward diffusion processes for our UniE2F under the conditional event data. The right and left parts indicate the inputs and results of our algorithm, while in the central plot, the solid and dashed lines with the same color represent the reverse-time sampling SDE and ODE trajectories under the same setting, respectively. The proposed method can adapt to different types of event-assisted frame reconstruction tasks. (A) event-based frame reconstruction: with input of only event, to reconstruct RGB frame; (B) frame prediction: with input of both event and the first frame to reconstruct the remaining frames; and (C) frame interpolation: with input of event and the first and last frames to reconstruct the intermediate frames. (D) denotes the ground-truth frames corresponding to the conditional event data.
Figure 2: The schematic of the proposed framework, which integrates event-based inter-frame residual guidance during the inference stage. At step $t$ ($t \leq \tau$), given event representations, we utilize a ResNet to predict the inter-frame residuals between consecutive frames. Then, these residuals are utilized to formulate the inter-frame residual loss $\mathcal{L}_\text{residual}$, which is optimized via a gradient descent algorithm to update noisy latent.
Figure 3: Visual comparison of event-based video frame reconstruction results on synthetic (1st and 2nd rows) and real-world (3rd and 4th rows) datasets. The event data is visualized as a polarity map.
Figure 4: Visual comparison on synthetic dataset: UniE2F is shown in both video frame interpolation (VFI) and video frame prediction (VFP) modes, while CBMNet is shown under the VFI setting. Note that the frames highlighted with purple borders denote the given frames, whereas frames highlighted with blue borders denote the predicted frames.
Figure 5: Qualitative Comparison on HQF stoffregen2020reducing, IJRR mueggler2017event, and MVSEC zhu2018multivehicle.
...and 13 more figures

Theorems & Definitions (2)

Proposition 1
proof

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

TL;DR

Abstract

UniE2F: A Unified Diffusion Framework for Event-to-Frame Reconstruction with Video Foundation Models

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (18)

Theorems & Definitions (2)