Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

Lin Zhu; Yunlong Zheng; Yijun Zhang; Xiao Wang; Lizhi Wang; Hua Huang

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

Lin Zhu, Yunlong Zheng, Yijun Zhang, Xiao Wang, Lizhi Wang, Hua Huang

TL;DR

The Temporal Residual Guided Diffusion Framework is introduced, which effectively leverages both temporal and frequency-based event priors and excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches.

Abstract

Event-based video reconstruction has garnered increasing attention due to its advantages, such as high dynamic range and rapid motion capture capabilities. However, current methods often prioritize the extraction of temporal information from continuous event flow, leading to an overemphasis on low-frequency texture features in the scene, resulting in over-smoothing and blurry artifacts. Addressing this challenge necessitates the integration of conditional information, encompassing temporal features, low-frequency texture, and high-frequency events, to guide the Denoising Diffusion Probabilistic Model (DDPM) in producing accurate and natural outputs. To tackle this issue, we introduce a novel approach, the Temporal Residual Guided Diffusion Framework, which effectively leverages both temporal and frequency-based event priors. Our framework incorporates three key conditioning modules: a pre-trained low-frequency intensity estimation module, a temporal recurrent encoder module, and an attention-based high-frequency prior enhancement module. In order to capture temporal scene variations from the events at the current moment, we employ a temporal-domain residual image as the target for the diffusion model. Through the combination of these three conditioning paths and the temporal residual framework, our framework excels in reconstructing high-quality videos from event flow, mitigating issues such as artifacts and over-smoothing commonly observed in previous approaches. Extensive experiments conducted on multiple benchmark datasets validate the superior performance of our framework compared to prior event-based reconstruction methods.

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

TL;DR

Abstract

Paper Structure (12 sections, 13 equations, 9 figures, 2 tables, 2 algorithms)

This paper contains 12 sections, 13 equations, 9 figures, 2 tables, 2 algorithms.

Introduction
Related Works
Methodology
Problem Statement
Frequency-based Event Priors Analysis
Temporal Residual Diffusion Framework
Triple-path Conditional Model Architecture
Experiments
Experimental Setup
Comparison with the State-of-the-Art Methods
Ablation Study
Conclusion

Figures (9)

Figure 1: Existing methods often emphasize low-frequency texture, causing over-smoothing and loss of high-frequency details in image reconstruction. This motivated us to explore a framework that strategically incorporates both temporal and high-frequency event priors.
Figure 2: Comparison of different strategies. (a) Directly predict intensity images from the accumulation features of past events, such as E2VID E2VIDE2VID-TPAMI, ETNet ETNet. (b) Jointly reconstruction from event feature accumulations and prediction from the previous frame, e.g., SPADE-E2VID SPADE. (c) Our temporal residual guided diffusion framework. While most methods adopt the initial two strategies, the inherent temporal feature extracting in these approaches results in the forfeiture of high-frequency information from the events. Our approach effectively tackles this contradiction by generating high-frequency temporal residuals through a conditional diffusion model.
Figure 3: Frequency domain analysis of reconstructed results. (a) Original image; b) Fourier spectrum chart of intensity image; (c) High-frequency components of intensity image; (d) Local magnification diagram (scaled for representation). Despite the events being very similar to the high-frequency map of the scene, E2VID and ETNet cannot reconstruct precise high-frequency details.
Figure 4: Overview of temporal residual diffusion framework. At Stage I, a pre-trained intensity predictor generates initial low-frequency estimation; At Stage II, the residual image is computed in the time domain and noise is added; At Stage III, a triple-path conditional model is used to predict noise. Please refer to Fig. \ref{['fig:samplingAca']} for specific details on the ResBlock with Cross Attention.
Figure 5: (a). Overview of sampling a video. The conditional diffusion model utilizes current events, intensity estimation, and features from accumulated events in the previous moment as guidance. This process generates high-frequency temporal residuals, contributing to the intensity image for each frame when added to the initial intensity estimation. (b). Overview of ResBlock with Cross Attention. Focus on events accumulation and intensity estimation features on the noisy temporal residuals, where GN denotes group normalization.
...and 4 more figures

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

TL;DR

Abstract

Temporal Residual Guided Diffusion Framework for Event-Driven Video Reconstruction

Authors

TL;DR

Abstract

Table of Contents

Figures (9)