Table of Contents
Fetching ...

SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN

Kang You, Zekai Xu, Chen Nie, Zhijie Deng, Qinghai Guo, Xiang Wang, Zhezhi He

TL;DR

Transformer-based SNNs have lagged behind their ANN counterparts in accuracy when converted directly. SpikeZIP-TF achieves near-lossless ANN-to-SNN conversion by introducing spike-equivalent operators—SESA, Spike-Softmax, and Spike-LayerNorm—tied to an ST-BIF+ neuron to preserve quantized-activation equivalence. The approach delivers state-of-the-art results on ImageNet (top-1 $83.82\%$) and SST-2 ( $93.79\%$ ), with ultra-low latency (as few as $8$ time-steps) and favorable power-accuracy trade-offs, while reducing training costs by leveraging pre-trained ANNs. This method enables efficient neuromorphic deployment of Transformer models across computer vision and natural language tasks and offers a practical pathway toward ultra-low-latency, energy-efficient inference.

Abstract

Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing (NLP), the Transformer-based SNNs are still encounting the lower accuracy w.r.t the ANN counterparts. In this work, we introduce a novel ANN-to-SNN conversion method called SpikeZIP-TF, where ANN and SNN are exactly equivalent, thus incurring no accuracy degradation. SpikeZIP-TF achieves 83.82% accuracy on CV dataset (ImageNet) and 93.79% accuracy on NLP dataset (SST-2), which are higher than SOTA Transformer-based SNNs. The code is available in GitHub: https://github.com/Intelligent-Computing-Research-Group/SpikeZIP_transformer

SpikeZIP-TF: Conversion is All You Need for Transformer-based SNN

TL;DR

Transformer-based SNNs have lagged behind their ANN counterparts in accuracy when converted directly. SpikeZIP-TF achieves near-lossless ANN-to-SNN conversion by introducing spike-equivalent operators—SESA, Spike-Softmax, and Spike-LayerNorm—tied to an ST-BIF+ neuron to preserve quantized-activation equivalence. The approach delivers state-of-the-art results on ImageNet (top-1 ) and SST-2 ( ), with ultra-low latency (as few as time-steps) and favorable power-accuracy trade-offs, while reducing training costs by leveraging pre-trained ANNs. This method enables efficient neuromorphic deployment of Transformer models across computer vision and natural language tasks and offers a practical pathway toward ultra-low-latency, energy-efficient inference.

Abstract

Spiking neural network (SNN) has attracted great attention due to its characteristic of high efficiency and accuracy. Currently, the ANN-to-SNN conversion methods can obtain ANN on-par accuracy SNN with ultra-low latency (8 time-steps) in CNN structure on computer vision (CV) tasks. However, as Transformer-based networks have achieved prevailing precision on both CV and natural language processing (NLP), the Transformer-based SNNs are still encounting the lower accuracy w.r.t the ANN counterparts. In this work, we introduce a novel ANN-to-SNN conversion method called SpikeZIP-TF, where ANN and SNN are exactly equivalent, thus incurring no accuracy degradation. SpikeZIP-TF achieves 83.82% accuracy on CV dataset (ImageNet) and 93.79% accuracy on NLP dataset (SST-2), which are higher than SOTA Transformer-based SNNs. The code is available in GitHub: https://github.com/Intelligent-Computing-Research-Group/SpikeZIP_transformer
Paper Structure (49 sections, 3 theorems, 34 equations, 7 figures, 20 tables)

This paper contains 49 sections, 3 theorems, 34 equations, 7 figures, 20 tables.

Key Result

Lemma A3.1

After entering the equilibrium state at $T_{\rm eq}$, the accumulated output spikes of one ST-BIF neuron can be derived as a closed-form equation of quantization function: where $V^\textrm{in} = \sum_{t=0}^{T_{\rm eq}} V^\textrm{in}_t$ is the accumulated input until $T_\textrm{eq}$, and $V_{t=0}$ denotes the initial membrane potential.

Figures (7)

  • Figure 1: Comparison of Transformer-based SNNs. The markers, represented by circles, star, and triangle shapes, denote the direct learning (DT) method, ANN-to-SNN (A2S) conversion method and using both the DT and A2S methods, respectively, where the area of the scatter corresponds to the model size. Results show that the pikeZIP-TF generated SNN achieves higher accuracy with greater model size than the other recent SNNs. The largest model size of SpikeZIP-TF on ImageNet is 304.33 MB.
  • Figure 2: The conversion pipeline of SpikeZIP-TF.
  • Figure 3: Architecture of Transformer-based SNN in SpikeZIP-TF. Compared to the vanilla Transformer, SpikeZIP-TF inserts the ST-BIF+ neuron ahead of and behind the matrix multiplication operations and substitutes SNN-unfriendly operators (dot product, Softmax and LayerNorm) with SNN-friendly ones (spiking dot product, Spike-Softmax and Spike-LayerNorm). TF: Transformer; $n$: sequence length; $d$: token dimension; $\{{\bm{Q}},{\bm{K}},{\bm{V}},{\bm{A}}\}$, $\{{\bm{Q}}_{\textrm{q}},{\bm{K}}_{\textrm{q}},{\bm{V}}_{\textrm{q}},{\bm{A}}_{\textrm{q}}\}$, $\{{\bm{Q}}_{\textrm{s}},{\bm{K}}_{\textrm{s}},{\bm{V}}_{\textrm{s}},{\bm{A}}_{\textrm{s}}\}$: {query, key, value, attention array}, {theirs quantized form} and {spike form}.
  • Figure 4: The process of matrix multiplication in SESA. The bracket part in (a) and (b) corresponds to \ref{['eqt:multi1']} and \ref{['eqt:multi2']} respectively. (a) ${\bm{X}}_{\textrm{s},t}, {\bm{O}}_{\textrm{s},t}$ represent the input and output spike trains in SNN at time-step $t$. (b) ${\bm{Q}}_{\textrm{s},t}, {\bm{K}}_{\textrm{s},t}$ denote the query and key in SNN at time-step $t$; ${\bm{S}}_{\textrm{Q},t}, {\bm{S}}_{\textrm{K},t}$ are the spike tracers in the neuron layers, which store the accumulated output for query and key. At each time-step, we utilize the accumulated output in the spike tracer to perform AA multiplication via three matrix multiplications.
  • Figure 5: Curves of accuracy versus time-step with different settings. (a) SpikeZIP-TF uses Roberta architecture (QANN is quantized with 32 levels); (b) SpikeZIP-TF uses ViT-small (QANN is quantized with 16 levels); (c) SpikeZIP-TF use ViT small/base/large as architecture on ImageNet; (d) Architecture is ViT-B on ImageNet, where QANNs (\ref{['fig:spikezip_arch']}) are quantized with different levels.
  • ...and 2 more figures

Theorems & Definitions (6)

  • Lemma A3.1
  • proof
  • Lemma A3.2
  • proof
  • Lemma A3.3
  • proof