E2HQV: High-Quality Video Generation from Event Camera via Theory-Inspired Model-Aided Deep Learning
Qiang Qu, Yiran Shen, Xiaoming Chen, Yuk Ying Chung, Tongliang Liu
TL;DR
E2HQV tackles the challenge of generating high-quality video frames from asynchronous event streams by marrying a theory-inspired E2V model with a model-aided learning framework. The approach introduces REFE, TSEM, and MAFG, where MAFG estimates key parameters ${\mathcal{E}}_{(+/-)}$ and ${\theta}_{(+/-)}$ from events and recursively generates frames via the relation $f_{t_1}^{x,y} = \exp( {\theta^{x,y}_{+} {\mathcal{E}}^{x,y}_{+} - \theta^{x,y}_{-} {\mathcal{E}}^{x,y}_{-}} )(f_{t_0}^{x,y} + k) - k$, with a constant $k$ and thresholds within short periods. A Temporal Shift Embedding (TSEM) module mitigates state-reset perturbations, enabling robust fusion of event features and reconstructed frames. Experiments on simulated and real-world datasets (IJRR, MVSEC, HQF) show that E2HQV outperforms seven SOTA methods by large margins on MSE and SSIM, with notable texture and contrast improvements, and a favorable computation-accuracy balance. This framework advances practical E2V by incorporating principled priors, improving texture reconstruction in complex scenes, and providing scalable, interpretable video generation from event data.
Abstract
The bio-inspired event cameras or dynamic vision sensors are capable of asynchronously capturing per-pixel brightness changes (called event-streams) in high temporal resolution and high dynamic range. However, the non-structural spatial-temporal event-streams make it challenging for providing intuitive visualization with rich semantic information for human vision. It calls for events-to-video (E2V) solutions which take event-streams as input and generate high quality video frames for intuitive visualization. However, current solutions are predominantly data-driven without considering the prior knowledge of the underlying statistics relating event-streams and video frames. It highly relies on the non-linearity and generalization capability of the deep neural networks, thus, is struggling on reconstructing detailed textures when the scenes are complex. In this work, we propose \textbf{E2HQV}, a novel E2V paradigm designed to produce high-quality video frames from events. This approach leverages a model-aided deep learning framework, underpinned by a theory-inspired E2V model, which is meticulously derived from the fundamental imaging principles of event cameras. To deal with the issue of state-reset in the recurrent components of E2HQV, we also design a temporal shift embedding module to further improve the quality of the video frames. Comprehensive evaluations on the real world event camera datasets validate our approach, with E2HQV, notably outperforming state-of-the-art approaches, e.g., surpassing the second best by over 40\% for some evaluation metrics.
