Table of Contents
Fetching ...

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

Ping Wang, Yulun Zhang, Lishun Wang, Xin Yuan

TL;DR

This work addresses the ill-posed challenge of video Snapshot Compressive Imaging by revealing an information skewness that favors spatial cues over temporal ones. It introduces HiSViT, a Hierarchical Separable Video Transformer built from Cross-Scale Separable MSA and Gated Self-Modulated FFN, combined with a frame-wise feature extraction strategy to avoid early temporal aggregation. The proposed architecture delivers state-of-the-art reconstruction quality on grayscale, color, and real captured videos while maintaining competitive or reduced computational cost and parameters. Extensive experiments and ablations validate the effectiveness of the CSS-MSA and GSM-FFN components, and the work provides public code and pretrained models to support reproducibility and further research.

Abstract

Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by $\!>\!0.5$ dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.

Hierarchical Separable Video Transformer for Snapshot Compressive Imaging

TL;DR

This work addresses the ill-posed challenge of video Snapshot Compressive Imaging by revealing an information skewness that favors spatial cues over temporal ones. It introduces HiSViT, a Hierarchical Separable Video Transformer built from Cross-Scale Separable MSA and Gated Self-Modulated FFN, combined with a frame-wise feature extraction strategy to avoid early temporal aggregation. The proposed architecture delivers state-of-the-art reconstruction quality on grayscale, color, and real captured videos while maintaining competitive or reduced computational cost and parameters. Extensive experiments and ablations validate the effectiveness of the CSS-MSA and GSM-FFN components, and the work provides public code and pretrained models to support reproducibility and further research.

Abstract

Transformers have achieved the state-of-the-art performance on solving the inverse problem of Snapshot Compressive Imaging (SCI) for video, whose ill-posedness is rooted in the mixed degradation of spatial masking and temporal aliasing. However, previous Transformers lack an insight into the degradation and thus have limited performance and efficiency. In this work, we tailor an efficient reconstruction architecture without temporal aggregation in early layers and Hierarchical Separable Video Transformer (HiSViT) as building block. HiSViT is built by multiple groups of Cross-Scale Separable Multi-head Self-Attention (CSS-MSA) and Gated Self-Modulated Feed-Forward Network (GSM-FFN) with dense connections, each of which is conducted within a separate channel portions at a different scale, for multi-scale interactions and long-range modeling. By separating spatial operations from temporal ones, CSS-MSA introduces an inductive bias of paying more attention within frames instead of between frames while saving computational overheads. GSM-FFN further enhances the locality via gated mechanism and factorized spatial-temporal convolutions. Extensive experiments demonstrate that our method outperforms previous methods by dB with comparable or fewer parameters and complexity. The source codes and pretrained models are released at https://github.com/pwangcs/HiSViT.
Paper Structure (25 sections, 5 equations, 10 figures, 5 tables)

This paper contains 25 sections, 5 equations, 10 figures, 5 tables.

Figures (10)

  • Figure 1: Our HiSViT achieves SOTA performance on (a) grayscale and (b) color video SCI reconstruction with comparable or fewer MACs and (c) parameters.
  • Figure 1: Computational complexity of different MSAs for an input size $T\!\times\! H\!\times\!W\!\times\!d$. $t\!\times\! h\!\times\!w$ denotes the 3D window size. $\rho$ is the spatial average pooling size.
  • Figure 2: Video SCI pipeline and its degradation. (a) involves the mixed degradation of spatial masking and temporal aliasing, caused by modulation ($\odot$) and multiplexing ($\boldsymbol{\Sigma}$). (b) is the structural similarity map between degraded frames and clear frames.
  • Figure 3: Visualization of shallow features extracted by 3D CNN in EfficientSCI wang2023efficientsci and RSTB (without temporal aggregation) in our model. Clearly, our frame-wise extraction can better retrieve the temporal correlations with fewer parameters ($0.28$ v.s. $1.12$ M) and MACs ($148.85$ v.s. $241.79$ G).
  • Figure 4: Illustration of the proposed video SCI reconstruction architecture.
  • ...and 5 more figures