Deep Optics for Video Snapshot Compressive Imaging
Ping Wang, Lishun Wang, Xin Yuan
TL;DR
The paper tackles the gap between theory and practice in video snapshot compressive imaging by integrating optics and computation: it introduces a structural mask design that enables full dynamic range and motion-aware measurements, and a Transformer-based decoder (Res2former) tailored for long-term temporal dependencies. By incorporating a sensor-response aware forward model into end-to-end training, the authors bridge the gap between synthetic simulations and real hardware, demonstrated on a DMD-based prototype. Key contributions include the proposed $\boldsymbol{M}_{\lambda}$ mask with $\sum_{t=1}^B \boldsymbol{M}_{\lambda}(:,:,t) = \mathbf{1}$ for FDR, the differentiable structural-mask optimization framework, and the Res2former decoder with temporal self-attention that achieves competitive reconstruction quality at lower computational cost. The results on both synthetic and real data show improved reconstruction quality and robust dynamic-range recovery, moving video SCI closer to practical, real-world deployment.
Abstract
Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motion-aware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a milestone for real-world video SCI. The source code and data are available at https://github.com/pwangcs/DeepOpticsSCI.
