Table of Contents
Fetching ...

Dual-Scale Transformer for Large-Scale Single-Pixel Imaging

Gang Qu, Ping Wang, Xin Yuan

TL;DR

This work tackles the challenge of high-fidelity, large-scale single-pixel imaging by bridging real SPI hardware with advanced reconstruction models. It introduces HATNet, a deep unfolding network that unrolls tensor ISTA on the Kronecker SPI model into two modules: a tensor gradient-descent projector and a deep denoiser based on a hybrid-attention Transformer (HAT) that captures both high- and low-frequency spatial features and global channel information. By leveraging Kronecker SPI, HATNet avoids the prohibitive cost of large vectorized measurement matrices while achieving state-of-the-art reconstruction quality on synthetic and real SPI data. The approach shows strong performance on full-size images, demonstrates robustness to illumination variations, and validates its practicality with a real SPI prototype, marking a significant advance toward real-world computational imaging. The combination of tensor ISTA unfolding and dual-scale, channel-aware attention offers a scalable path for high-resolution SPI in practical settings.

Abstract

Single-pixel imaging (SPI) is a potential computational imaging technique which produces image by solving an illposed reconstruction problem from few measurements captured by a single-pixel detector. Deep learning has achieved impressive success on SPI reconstruction. However, previous poor reconstruction performance and impractical imaging model limit its real-world applications. In this paper, we propose a deep unfolding network with hybrid-attention Transformer on Kronecker SPI model, dubbed HATNet, to improve the imaging quality of real SPI cameras. Specifically, we unfold the computation graph of the iterative shrinkagethresholding algorithm (ISTA) into two alternative modules: efficient tensor gradient descent and hybrid-attention multiscale denoising. By virtue of Kronecker SPI, the gradient descent module can avoid high computational overheads rooted in previous gradient descent modules based on vectorized SPI. The denoising module is an encoder-decoder architecture powered by dual-scale spatial attention for high- and low-frequency aggregation and channel attention for global information recalibration. Moreover, we build a SPI prototype to verify the effectiveness of the proposed method. Extensive experiments on synthetic and real data demonstrate that our method achieves the state-of-the-art performance. The source code and pre-trained models are available at https://github.com/Gang-Qu/HATNet-SPI.

Dual-Scale Transformer for Large-Scale Single-Pixel Imaging

TL;DR

This work tackles the challenge of high-fidelity, large-scale single-pixel imaging by bridging real SPI hardware with advanced reconstruction models. It introduces HATNet, a deep unfolding network that unrolls tensor ISTA on the Kronecker SPI model into two modules: a tensor gradient-descent projector and a deep denoiser based on a hybrid-attention Transformer (HAT) that captures both high- and low-frequency spatial features and global channel information. By leveraging Kronecker SPI, HATNet avoids the prohibitive cost of large vectorized measurement matrices while achieving state-of-the-art reconstruction quality on synthetic and real SPI data. The approach shows strong performance on full-size images, demonstrates robustness to illumination variations, and validates its practicality with a real SPI prototype, marking a significant advance toward real-world computational imaging. The combination of tensor ISTA unfolding and dual-scale, channel-aware attention offers a scalable path for high-resolution SPI in practical settings.

Abstract

Single-pixel imaging (SPI) is a potential computational imaging technique which produces image by solving an illposed reconstruction problem from few measurements captured by a single-pixel detector. Deep learning has achieved impressive success on SPI reconstruction. However, previous poor reconstruction performance and impractical imaging model limit its real-world applications. In this paper, we propose a deep unfolding network with hybrid-attention Transformer on Kronecker SPI model, dubbed HATNet, to improve the imaging quality of real SPI cameras. Specifically, we unfold the computation graph of the iterative shrinkagethresholding algorithm (ISTA) into two alternative modules: efficient tensor gradient descent and hybrid-attention multiscale denoising. By virtue of Kronecker SPI, the gradient descent module can avoid high computational overheads rooted in previous gradient descent modules based on vectorized SPI. The denoising module is an encoder-decoder architecture powered by dual-scale spatial attention for high- and low-frequency aggregation and channel attention for global information recalibration. Moreover, we build a SPI prototype to verify the effectiveness of the proposed method. Extensive experiments on synthetic and real data demonstrate that our method achieves the state-of-the-art performance. The source code and pre-trained models are available at https://github.com/Gang-Qu/HATNet-SPI.
Paper Structure (13 sections, 8 equations, 7 figures, 4 tables)

This paper contains 13 sections, 8 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: (a) Our built SPI prototype (OL: objective lens, DMD: digital micromirror device, DAQ card: data acquisition card). (b) Performance comparison of different methods on Set11 dataset at different sampling ratios. (c) Real experimental results of different methods at sampling ratio of 25%.
  • Figure 2: Illustration of the proposed method. (a) demonstrates the Kronecker SPI model. As shown in (b), our DUN aims to reconstruct a high-fidelity image $\hat{{\bf X}}$ from the initialization input ${\bf X}_0$, which is composed of multiple stages with skip connections and each stage involves a tensor gradient descent (TGD) operator in \ref{['eq: z11']} and a U-shaped deep denoiser as Eq. \ref{['eq: x11']}. The deep denoiser is powered by the proposed HATB, each of which consists of residual dual-scale spatial-wise self-attention (S-SA), feed-forward network (FFN), and channel-wise self-attention (C-SA). The structure of S-SA and C-SA are shown in (c) and (d), respectively.
  • Figure 3: Visualization of different methods on (a) Barbara and (b) Lena at SR = $10\%$.
  • Figure 4: Experimental results of (a) cartoon tiger and (b) resolution target reconstructed by different methods at SR $=25\%$.
  • Figure 5: Large-scale experimental results with $768 \!\times\! 1024$ pixels at SR = $12.5\%$.
  • ...and 2 more figures