Table of Contents
Fetching ...

Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search

Boxun Xu, Yufei Song, Peng Li

TL;DR

This paper tackles the challenge of deploying large spiking vision transformers on resource-limited edge devices by introducing SpikeHQ, a heterogeneous, layer-wise quantization framework guided by neural-architecture search. By allowing each layer to adopt either a uniform or power-of-two quantization with mixed bit-widths and employing quantization-aware training, SpikeHQ achieves substantial compression and energy savings while maintaining near-baseline accuracy on neuromorphic vision tasks. Key contributions include a hardware-aware NAS formulation, a differentiable search strategy with Gumbel-softmax, and a detailed hardware overhead model that informs per-layer quantization choices. The proposed method yields energy reductions of up to $10.2\times$ and storage reductions up to $10.19\times$, with average weight precision around $2.16$–$3.24$ bits and accuracy losses typically below $1\%$, highlighting a practical path to efficient spiking transformers on edge hardware.

Abstract

Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have garnered significant interest, incorporating attention mechanisms akin to their counterparts in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However, deploying large spiking transformer models on resource-constrained edge devices such as mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization. Our approach optimizes the quantization of each layer using one of two distinct quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions. Our heterogeneous quantization demonstrates the feasibility of maintaining high performance for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a model compression rate of 8.71x-10.19x for standard floating-point spiking transformers. Moreover, the proposed approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.

Trimming Down Large Spiking Vision Transformers via Heterogeneous Quantization Search

TL;DR

This paper tackles the challenge of deploying large spiking vision transformers on resource-limited edge devices by introducing SpikeHQ, a heterogeneous, layer-wise quantization framework guided by neural-architecture search. By allowing each layer to adopt either a uniform or power-of-two quantization with mixed bit-widths and employing quantization-aware training, SpikeHQ achieves substantial compression and energy savings while maintaining near-baseline accuracy on neuromorphic vision tasks. Key contributions include a hardware-aware NAS formulation, a differentiable search strategy with Gumbel-softmax, and a detailed hardware overhead model that informs per-layer quantization choices. The proposed method yields energy reductions of up to and storage reductions up to , with average weight precision around bits and accuracy losses typically below , highlighting a practical path to efficient spiking transformers on edge hardware.

Abstract

Spiking Neural Networks (SNNs) are amenable to deployment on edge devices and neuromorphic hardware due to their lower dissipation. Recently, SNN-based transformers have garnered significant interest, incorporating attention mechanisms akin to their counterparts in Artificial Neural Networks (ANNs) while demonstrating excellent performance. However, deploying large spiking transformer models on resource-constrained edge devices such as mobile phones, still poses significant challenges resulted from the high computational demands of large uncompressed high-precision models. In this work, we introduce a novel heterogeneous quantization method for compressing spiking transformers through layer-wise quantization. Our approach optimizes the quantization of each layer using one of two distinct quantization schemes, i.e., uniform or power-of-two quantification, with mixed bit resolutions. Our heterogeneous quantization demonstrates the feasibility of maintaining high performance for spiking transformers while utilizing an average effective resolution of 3.14-3.67 bits with less than a 1% accuracy drop on DVS Gesture and CIFAR10-DVS datasets. It attains a model compression rate of 8.71x-10.19x for standard floating-point spiking transformers. Moreover, the proposed approach achieves a significant energy reduction of 5.69x, 8.72x, and 10.2x while maintaining high accuracy levels of 85.3%, 97.57%, and 80.4% on N-Caltech101, DVS-Gesture, and CIFAR10-DVS datasets, respectively.

Paper Structure

This paper contains 14 sections, 15 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Heterogeneous Quantization Compression on Spiking Vision Transformers.
  • Figure 2: ${SpikeHQ}$: proposed heterogeneous quantization by bi-level neural architecture search.
  • Figure 3: Evolution of quantization scheme selection probabilities for the last layer of spiking transformers across neuromorphic datasets and the total loss during architecture search.
  • Figure 4: Breakdown of normalized energy consumption and storage overhead of spiking transformers across different neuromorphic vision tasks. The energy consumption and storage overhead are quantized across tokenizers, self-attention layers, and MLP layers before and after applying the proposed quantization method ${SpikeHQ}$, under different hyperparameters. Red colored lines represent model accuracies; Pie charts break the energy consumption and storage of self-attention layers into the query, key, value, and output layers when $\beta=2.0$.
  • Figure 5: Distribution of Weight parameter value before(top figures) and after(bottom figures) applying ${SpikeHQ}$. Median values are highlighted by blue lines, with boxes delineating the 25th and 75th percentile ranges. The external black lines represent the 5th and 95th percentiles, while outliers beyond these markers are denoted with black dots. After applying ${SpikeHQ}$, the value range becomes more compact across different datasets; Red dots represent the layer-wise quantization choices: '2l' and '4l' for 2-bit and 4-bit power-of-two; '2u' and '4u' for 2-bit and 4-bit uniform; 'fp32' for 32-bit floating-point.