Table of Contents
Fetching ...

E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning

Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Tongliang Liu

TL;DR

E2HQV tackles the challenge of generating high-quality video frames from asynchronous event streams by marrying a theory-inspired E2V model with a model-aided learning framework. The approach introduces REFE, TSEM, and MAFG, where MAFG estimates key parameters ${\mathcal{E}}_{(+/-)}$ and ${\theta}_{(+/-)}$ from events and recursively generates frames via the relation $f_{t_1}^{x,y} = \exp( {\theta^{x,y}_{+} {\mathcal{E}}^{x,y}_{+} - \theta^{x,y}_{-} {\mathcal{E}}^{x,y}_{-}} )(f_{t_0}^{x,y} + k) - k$, with a constant $k$ and thresholds within short periods. A Temporal Shift Embedding (TSEM) module mitigates state-reset perturbations, enabling robust fusion of event features and reconstructed frames. Experiments on simulated and real-world datasets (IJRR, MVSEC, HQF) show that E2HQV outperforms seven SOTA methods by large margins on MSE and SSIM, with notable texture and contrast improvements, and a favorable computation-accuracy balance. This framework advances practical E2V by incorporating principled priors, improving texture reconstruction in complex scenes, and providing scalable, interpretable video generation from event data.

Abstract

The bio-inspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose \textbf{E2HQV}, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40\% for some evaluation metrics.

E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning

TL;DR

E2HQV tackles the challenge of generating high-quality video frames from asynchronous event streams by marrying a theory-inspired E2V model with a model-aided learning framework. The approach introduces REFE, TSEM, and MAFG, where MAFG estimates key parameters and from events and recursively generates frames via the relation , with a constant and thresholds within short periods. A Temporal Shift Embedding (TSEM) module mitigates state-reset perturbations, enabling robust fusion of event features and reconstructed frames. Experiments on simulated and real-world datasets (IJRR, MVSEC, HQF) show that E2HQV outperforms seven SOTA methods by large margins on MSE and SSIM, with notable texture and contrast improvements, and a favorable computation-accuracy balance. This framework advances practical E2V by incorporating principled priors, improving texture reconstruction in complex scenes, and providing scalable, interpretable video generation from event data.

Abstract

The bio-inspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose \textbf{E2HQV}, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40\% for some evaluation metrics.
Paper Structure (18 sections, 12 equations, 6 figures, 4 tables)

This paper contains 18 sections, 12 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Conceptual Overview of the Proposed Model-Aided Learning Framework.
  • Figure 2: Overview of the Proposed Model-Aided Learning Framework.
  • Figure 3: The detailed settings of the model-aided frame generator (MAFG). The generator accepts multimodal features $\Psi$ integrated from the output of the REFE module and the TSEM module, which are then input into a shared downsampling encoder, $h^{enc}$. This is followed by two upsampling decoders, $h^{dec}_{\mathcal{E}_{(+/-)}}$ and $h^{dec}_{\theta_{(+/-)}}$, and four output branches. These branches are meticulously designed for the estimation of the key parameters, $\mathcal{E_{(+/-)}}$ and $\theta_{(+/-)}$, respectively. Provided the estimated parameters, video frames are recursively generated from events according to Equation \ref{['eq:f_n']} derived from the theoretical relation between frame and event-stream.
  • Figure 4: Structure of the Temporal Shift Embedding.
  • Figure 5: Qualitative Analysis across Datasets. Comparative visualizations of sequence data from HQF (rows 1-3), IJRR (rows 4-6), and MVSEC (rows 7-9). The evaluated baseline methods often exhibit limitations such as diminished contrast, noticeable blur, and prominent artifacts. In contrast, our reconstructions offer high contrast and are adept at maintaining sharp edge details, while manifesting minimal artifacts in regions devoid of texture.
  • ...and 1 more figures