Event-Enhanced Snapshot Compressive Videography at 10K FPS

Bo Zhang; Jinli Suo; Qionghai Dai

Event-Enhanced Snapshot Compressive Videography at 10K FPS

Bo Zhang, Jinli Suo, Qionghai Dai

TL;DR

The paper tackles the challenge of ultrafast videography with low data bandwidth by marrying intensity-based snapshot compressive imaging with event-camera information. It introduces a dual-path hardware design and a dual-branch Transformer that jointly leverage coded intensity measurements and asynchronous events to reconstruct dense high-speed frames at 10K FPS, demonstrated on simulated and real data. Key contributions include the compact, photon-efficient dual-path optical setup and the architecture that fuses intensity and event information for both dense-frame reconstruction and timestamp-aware interpolation, outperforming state-of-the-art video SCI and VFI methods. The proposed approach offers a practical pathway to high-throughput, megapixel-rate videography with low-cost sensors, albeit with current limits on real-time processing and reliance on specialized hardware.

Abstract

Video snapshot compressive imaging (SCI) encodes the target dynamic scene compactly into a snapshot and reconstructs its high-speed frame sequence afterward, greatly reducing the required data footprint and transmission bandwidth as well as enabling high-speed imaging with a low frame rate intensity camera. In implementation, high-speed dynamics are encoded via temporally varying patterns, and only frames at corresponding temporal intervals can be reconstructed, while the dynamics occurring between consecutive frames are lost. To unlock the potential of conventional snapshot compressive videography, we propose a novel hybrid "intensity+event" imaging scheme by incorporating an event camera into a video SCI setup. Our proposed system consists of a dual-path optical setup to record the coded intensity measurement and intermediate event signals simultaneously, which is compact and photon-efficient by collecting the half photons discarded in conventional video SCI. Correspondingly, we developed a dual-branch Transformer utilizing the reciprocal relationship between two data modes to decode dense video frames. Extensive experiments on both simulated and real-captured data demonstrate our superiority to state-of-the-art video SCI and video frame interpolation (VFI) methods. Benefiting from the new hybrid design leveraging both intrinsic redundancy in videos and the unique feature of event cameras, we achieve high-quality videography at 0.1ms time intervals with a low-cost CMOS image sensor working at 24 FPS.

Event-Enhanced Snapshot Compressive Videography at 10K FPS

TL;DR

Abstract

Paper Structure (24 sections, 4 equations, 6 figures, 3 tables)

This paper contains 24 sections, 4 equations, 6 figures, 3 tables.

Introduction
Related work
Video snapshot compressive imaging
Mathematical formulation
Optical designs
Reconstruction algorithms
Event cameras
Video frame interpolation
Method
Formulation of the dual-path intensity+event SCI scheme
Dual-path intensity+event SCI setup
Dual-branch Transformer reconstructing highly dynamic scenes
Experiments
Simulation experiment
Datasets
...and 9 more sections

Figures (6)

Figure 1: One representative example demonstrating the inputs and outputs of the proposed imaging scheme. From a coded snapshot and a series of events happening within the exposure elapse, we can reconstruct around 2500 frames (corresponding to 10K frames per second) recording the dynamics of the target scene with high fidelity.
Figure 2: The proposed dual-arm imaging setup. (a) and (b) display the schematic optical path and established prototype respectively. The incident light of a scene point converges at the image plane, which is transferred to the 2nd image plane by Relay lens $\#1$ and the 3rd plane by Relay lens $\#2$. Before arriving at the 2nd imaging plane, the light beam is split by a PBS into two halves, one detected with a DVS and the other modulated by LCoS and then captured with a CMOS.
Figure 3: The network structure of the proposed dual-branch Transformer for dense video reconstruction. The dual-mode data representation module provides initialization of intensity frames and generates intensity as well as event tokens. The main branch processes two types of tokens and reconstructs frames at the encoding frame rate, and the 2nd branch additionally takes in timestamp-aware event tokens and produces much denser sequence with high fidelity.
Figure 4: Reconstructed consecutive frames of two challenging scenes from the BS-ERGB test set. VFIformer produces blurry results for such fast motions even with only skip=1 and fails with skip=15. Time Lens predicts better motion with auxiliary events but suffers color distortion with a larger skip. STFormer+VFI (lower left: VFIformer, upper right: Time Lens) reconstructs frames with decent quality at the frames corresponding to the encoding device but interpolates blurry results at the intermediate timestamps. In comparison, our method reconstructs high-quality frames with consistently accurate dynamics across dense timestamps.
Figure 5: Visual results of exemplary frames in color and gray benchmark datasets, with zoomed-in views provided for a clearer performance comparison. For the results by STFormer+VFI, the lower left corner is by VFIformer, and the upper right is by Time Lens.
...and 1 more figures

Event-Enhanced Snapshot Compressive Videography at 10K FPS

TL;DR

Abstract

Event-Enhanced Snapshot Compressive Videography at 10K FPS

Authors

TL;DR

Abstract

Table of Contents

Figures (6)