Table of Contents
Fetching ...

HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

Burak Ercan, Onur Eker, Canberk Saglam, Aykut Erdem, Erkut Erdem

TL;DR

HyperE2VID addresses the challenge of reconstructing high-quality intensity videos from sparse event streams by using hypernetworks to generate per-pixel adaptive filters. A context fusion module guides the dynamic filters with information from event voxel grids and previously reconstructed frames, and a curriculum learning strategy stabilizes training. Empirical results show HyperE2VID outperforms state-of-the-art E2VID+-based methods with fewer parameters and faster inference, across multiple datasets and scenarios including high frame rates and motionless periods. An extensive ablation study clarifies the roles of context information, dynamic convolutions, and hypernetworks, while a simple post-processing step can further reduce textureless-region artifacts. This combination yields robust, efficient event-based video reconstruction with practical applicability to real-world fast-motion and low-light conditions.

Abstract

Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Our comprehensive experimental evaluations across various benchmark datasets reveal that HyperE2VID not only surpasses current state-of-the-art methods in terms of reconstruction quality but also achieves this with fewer parameters, reduced computational requirements, and accelerated inference times.

HyperE2VID: Improving Event-Based Video Reconstruction via Hypernetworks

TL;DR

HyperE2VID addresses the challenge of reconstructing high-quality intensity videos from sparse event streams by using hypernetworks to generate per-pixel adaptive filters. A context fusion module guides the dynamic filters with information from event voxel grids and previously reconstructed frames, and a curriculum learning strategy stabilizes training. Empirical results show HyperE2VID outperforms state-of-the-art E2VID+-based methods with fewer parameters and faster inference, across multiple datasets and scenarios including high frame rates and motionless periods. An extensive ablation study clarifies the roles of context information, dynamic convolutions, and hypernetworks, while a simple post-processing step can further reduce textureless-region artifacts. This combination yields robust, efficient event-based video reconstruction with practical applicability to real-world fast-motion and low-light conditions.

Abstract

Event-based cameras are becoming increasingly popular for their ability to capture high-speed motion with low latency and high dynamic range. However, generating videos from events remains challenging due to the highly sparse and varying nature of event data. To address this, in this study, we propose HyperE2VID, a dynamic neural network architecture for event-based video reconstruction. Our approach uses hypernetworks to generate per-pixel adaptive filters guided by a context fusion module that combines information from event voxel grids and previously reconstructed intensity images. We also employ a curriculum learning strategy to train the network more robustly. Our comprehensive experimental evaluations across various benchmark datasets reveal that HyperE2VID not only surpasses current state-of-the-art methods in terms of reconstruction quality but also achieves this with fewer parameters, reduced computational requirements, and accelerated inference times.
Paper Structure (4 sections, 5 figures, 2 tables)

This paper contains 4 sections, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Understanding the role of context information. This figure shows frames, events, and reconstructions from two distinct scenes: one with fast motion (top) and another with slow motion (bottom). It highlights the significance of utilizing event and reconstruction data as context information for optimal results.
  • Figure 2: Effect of using event voxel grids with different temporal windows and event numbers. We consider four best performing methods (E2VID+, FireNet+, ET-Net, and HyperE2VID), and compute their mean LPIPS scores obtained on ECD, MVSEC, and HQF datasets, using a variety of event grouping settings. (a) We conduct ten sets of experiments, each using a different temporal window ranging from 10ms to 100ms. (b) We conduct ten experiment runs, each utilizing fixed-number event grouping with a different event count ranging from 2K to 45K. For (a) and (b), we employ a tolerance of 1 ms to match the reconstructions with ground truth frames, and calculate LPIPS scores whenever there is a match. Then, we plot mean LPIPS scores across these experiments runs for each method. The results demonstrate the superiority of the proposed HyperE2VID architecture for generating high-quality reconstructions, over a wide range of event grouping settings.
  • Figure 3: High frame rate video synthesis. We employ a simple approach with fixed-temporal-window event grouping for generating videos with high FPS. Here we present frames corresponding to the first second of the slider_depth sequence from the ECD dataset, taken from videos reconstructed at 200Hz, 500Hz, 1kHz, 2kHz, and 5kHz, which are generated by using temporal windows of 5ms, 2ms, 1ms, 500µs, and 200µs, respectively. While most of the other methods start to generate videos with lower visual quality as we increase FPS above one thousand, HyperE2VID maintains its high contrast and sharp reconstructions even when generating videos with several thousand frames per second.
  • Figure 4: Assessing reconstruction quality in motionless sections. Stationary sections in event sequences pose additional challenges for video reconstruction since the event rate drop-offs to almost zero, with only noise events being generated. Here, we consider a segment from the UZH-FPV Drone Racing dataset, where the drone lands on a board with ArUco markers and stops. For each method, we present reconstructions from the initial time just after the drone stops in the leftmost column and three more reconstructions at one-second intervals in subsequent columns. The desired functionality for methods is to retain their most recent reconstructions during the pause segment, but most of them start to generate intensity images with degraded quality within a few seconds by gradually decaying images and revealing artifacts such as blurry and bleeding edges. On the other hand, HyperE2VID manages to preserve its high contrast and sharp reconstructions during the motionless segments, thanks to its network architecture, which allows it to dynamically adapt to highly varying event data.
  • Figure 5: Visual results of post-processing. Here, we consider two scenes from the ECD and HQF datasets and present reconstructions of E2VID+, ET-Net, and HyperE2VID for each scene, with or without post-processing. The results demonstrate that the post-processing can satisfactorily remove or minimize most of the fine-scale artifacts, such as checkerboard patterns.