Table of Contents
Fetching ...

Deep Optics for Video Snapshot Compressive Imaging

Ping Wang, Lishun Wang, Xin Yuan

TL;DR

The paper tackles the gap between theory and practice in video snapshot compressive imaging by integrating optics and computation: it introduces a structural mask design that enables full dynamic range and motion-aware measurements, and a Transformer-based decoder (Res2former) tailored for long-term temporal dependencies. By incorporating a sensor-response aware forward model into end-to-end training, the authors bridge the gap between synthetic simulations and real hardware, demonstrated on a DMD-based prototype. Key contributions include the proposed $\boldsymbol{M}_{\lambda}$ mask with $\sum_{t=1}^B \boldsymbol{M}_{\lambda}(:,:,t) = \mathbf{1}$ for FDR, the differentiable structural-mask optimization framework, and the Res2former decoder with temporal self-attention that achieves competitive reconstruction quality at lower computational cost. The results on both synthetic and real data show improved reconstruction quality and robust dynamic-range recovery, moving video SCI closer to practical, real-world deployment.

Abstract

Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motion-aware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a milestone for real-world video SCI. The source code and data are available at https://github.com/pwangcs/DeepOpticsSCI.

Deep Optics for Video Snapshot Compressive Imaging

TL;DR

The paper tackles the gap between theory and practice in video snapshot compressive imaging by integrating optics and computation: it introduces a structural mask design that enables full dynamic range and motion-aware measurements, and a Transformer-based decoder (Res2former) tailored for long-term temporal dependencies. By incorporating a sensor-response aware forward model into end-to-end training, the authors bridge the gap between synthetic simulations and real hardware, demonstrated on a DMD-based prototype. Key contributions include the proposed mask with for FDR, the differentiable structural-mask optimization framework, and the Res2former decoder with temporal self-attention that achieves competitive reconstruction quality at lower computational cost. The results on both synthetic and real data show improved reconstruction quality and robust dynamic-range recovery, moving video SCI closer to practical, real-world deployment.

Abstract

Video snapshot compressive imaging (SCI) aims to capture a sequence of video frames with only a single shot of a 2D detector, whose backbones rest in optical modulation patterns (also known as masks) and a computational reconstruction algorithm. Advanced deep learning algorithms and mature hardware are putting video SCI into practical applications. Yet, there are two clouds in the sunshine of SCI: i) low dynamic range as a victim of high temporal multiplexing, and ii) existing deep learning algorithms' degradation on real system. To address these challenges, this paper presents a deep optics framework to jointly optimize masks and a reconstruction network. Specifically, we first propose a new type of structural mask to realize motion-aware and full-dynamic-range measurement. Considering the motion awareness property in measurement domain, we develop an efficient network for video SCI reconstruction using Transformer to capture long-term temporal dependencies, dubbed Res2former. Moreover, sensor response is introduced into the forward model of video SCI to guarantee end-to-end model training close to real system. Finally, we implement the learned structural masks on a digital micro-mirror device. Experimental results on synthetic and real data validate the effectiveness of the proposed framework. We believe this is a milestone for real-world video SCI. The source code and data are available at https://github.com/pwangcs/DeepOpticsSCI.
Paper Structure (16 sections, 12 equations, 8 figures, 5 tables, 1 algorithm)

This paper contains 16 sections, 12 equations, 8 figures, 5 tables, 1 algorithm.

Figures (8)

  • Figure 1: The proposed deep optics framework brings a significant improvement for real-world video SCI as demonstrated in real results (a), (b), and (c), got by previous SOTA STFormer wang2022spatial, current STFormer under our framework, and our Res2former, respectively. (d) summarizes the comparison between Res2former and STFormer in terms of PSNR (vertical axis), FLOPs (horizontal axis), and Parameters (circle radius). The proposed Res2former achieves competitive performance ($35.98$ dB) with only $28.15\%$ FLOPs and $56.57\%$ parameters of STFormer ($36.34$ dB). By increasing parameters to STFormer's level, large Res2former can lead to a better performance ($36.56$ dB). By the way, STFormer under our framework can increase by $0.35$ dB.
  • Figure 2: Illustration of video SCI encoder. High-speed scene is first optically modulated with temporally-varying masks and then integrated into a single digital image (i.e., snapshot measurement) through an off-the-shelf image sensor. In the process, optical modulation and sensor response are two key ingredients.
  • Figure 3: Proposed structural mask (b) vs. widely-used random binary mask (c). As demonstrated in (a), structural mask values represent the transmittance of incident light and the sum of values across temporal dimension is $1$. It lead to the motion-aware measurement (d), containing more visual information than the measurement (e) captured by random binary mask.
  • Figure 4: Deep optics framework for the joint optimization of structural mask and a deep reconstruction network. $\oplus$, $\raisebox{-0.001pt}{\textcircled{ C}}$/$\raisebox{-0.001pt}{\textcircled{ D}}$, and $\otimes$ denote element-wise addition, channel concentration/division, and matrix multiplication, respectively. For clarity, ResTSA module is depicted as a $3$-branch structure. In the decoder, the number of ResTSA modules can be adjusted. By default, $(N_1, N_2) \!=\! (3,3).$
  • Figure 5: Illustration of structural mask optimization. During forward propagation, $\boldsymbol{y} = {\cal R}[ {\cal F}({\boldsymbol{\Phi}}') \cdot \boldsymbol{x}]$. During back propagation, the derivative of ${\cal R}$ and ${\cal F}$ are set to 1. Noise should be considered into the encoder when error caused by measurement noise and physical mask miscalibration is non-negligible.
  • ...and 3 more figures