Table of Contents
Fetching ...

LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

Lingfeng Liu, Dong Ni, Hangjie Yuan

TL;DR

This work introduces a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume, and incorporates a learnable under-sampling mask tailored for pre-acquisition modulation, a Vision Transformer variant.

Abstract

Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality. Code will be available at https://github.com/MaxLLF/LUM-ViT.

LUM-ViT: Learnable Under-sampling Mask Vision Transformer for Bandwidth Limited Optical Signal Acquisition

TL;DR

This work introduces a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume, and incorporates a learnable under-sampling mask tailored for pre-acquisition modulation, a Vision Transformer variant.

Abstract

Bandwidth constraints during signal acquisition frequently impede real-time detection applications. Hyperspectral data is a notable example, whose vast volume compromises real-time hyperspectral detection. To tackle this hurdle, we introduce a novel approach leveraging pre-acquisition modulation to reduce the acquisition volume. This modulation process is governed by a deep learning model, utilizing prior information. Central to our approach is LUM-ViT, a Vision Transformer variant. Uniquely, LUM-ViT incorporates a learnable under-sampling mask tailored for pre-acquisition modulation. To further optimize for optical calculations, we propose a kernel-level weight binarization technique and a three-stage fine-tuning strategy. Our evaluations reveal that, by sampling a mere 10% of the original image pixels, LUM-ViT maintains the accuracy loss within 1.8% on the ImageNet classification task. The method sustains near-original accuracy when implemented on real-world optical hardware, demonstrating its practicality. Code will be available at https://github.com/MaxLLF/LUM-ViT.
Paper Structure (23 sections, 12 equations, 13 figures, 4 tables)

This paper contains 23 sections, 12 equations, 13 figures, 4 tables.

Figures (13)

  • Figure 1: The comprehensive structure of LUM-ViT.The method unfolds in two stages: the electronic-only training and infernce phase, referred to as the Training Phase, with dataflow depicted in green, and the DMD-involved inference phase, referred to as the Real-World Application Phase, with dataflow depicted in blue. The illustration shows how the first patch is modulated by the DMD using the first kernel, a process replicable to all kernels and patches, though not depicted. The learnable under-sampling mask determines their selection.
  • Figure 2: The function of the DMD acquisition system.(a) illustrates the spatial modulation principle of incoming light by the DMD. (b) outlines the modulation process. The multichannel input matrix undergoes binary masking via the DMD, following its display pattern. Lens 2 then optically sums the residual pixels.
  • Figure 3: The dataflow in LUM-ViT.
  • Figure 4: The main results in the Training Phase of LUM-ViT, providing the Top-1 Acc results on the ImageNet-1k classification task. Comparisons with the basic methods reveals the reliability of the learnable under-sampling mask strategy, especially at extremely low under-sampling rates (2%-5%). The dark red line marks the baseline upper bound.
  • Figure 5: Comparison of mask patch ratios.This is not a performance comparison for network lightweighting, but to prove the effectiveness of the learnable mask of LUM-ViT.
  • ...and 8 more figures