Table of Contents
Fetching ...

Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

Man Yao, Xuerui Qiu, Tianxiang Hu, Jiakui Hu, Yuhong Chou, Keyu Tian, Jianxing Liao, Luziwei Leng, Bo Xu, Guoqi Li

TL;DR

This work addresses the performance and training cost gap in Spiking Neural Networks by identifying binary spike firing as a mechanistic flaw and introducing Spike Firing Approximation (SFA), which trains with integer activations and performs spike-driven inference. It presents an efficient spike-driven Transformer backbone (E-SpikeFormer) and a spike-masked autoencoder with Spike Sparse Convolution to enable scalable SNNs. Across ImageNet-1k, COCO, ADE20K, and HAR-DVS, the approach achieves state-of-the-art results among SNNs and competitive performance versus ANNs while delivering substantial training time and energy savings. The work demonstrates that SNNs can match ANN performance with low power, paving the way for SNNs as general visual backbones in neuromorphic hardware.

Abstract

The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5\%, 79.8\%, 84.0\%, and 86.2\% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2\% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5$\times$ and 3.9$\times$, respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at https://github.com/BICLab/Spike-Driven-Transformer-V3.

Scaling Spike-driven Transformer with Efficient Spike Firing Approximation Training

TL;DR

This work addresses the performance and training cost gap in Spiking Neural Networks by identifying binary spike firing as a mechanistic flaw and introducing Spike Firing Approximation (SFA), which trains with integer activations and performs spike-driven inference. It presents an efficient spike-driven Transformer backbone (E-SpikeFormer) and a spike-masked autoencoder with Spike Sparse Convolution to enable scalable SNNs. Across ImageNet-1k, COCO, ADE20K, and HAR-DVS, the approach achieves state-of-the-art results among SNNs and competitive performance versus ANNs while delivering substantial training time and energy savings. The work demonstrates that SNNs can match ANN performance with low power, paving the way for SNNs as general visual backbones in neuromorphic hardware.

Abstract

The ambition of brain-inspired Spiking Neural Networks (SNNs) is to become a low-power alternative to traditional Artificial Neural Networks (ANNs). This work addresses two major challenges in realizing this vision: the performance gap between SNNs and ANNs, and the high training costs of SNNs. We identify intrinsic flaws in spiking neurons caused by binary firing mechanisms and propose a Spike Firing Approximation (SFA) method using integer training and spike-driven inference. This optimizes the spike firing pattern of spiking neurons, enhancing efficient training, reducing power consumption, improving performance, enabling easier scaling, and better utilizing neuromorphic chips. We also develop an efficient spike-driven Transformer architecture and a spike-masked autoencoder to prevent performance degradation during SNN scaling. On ImageNet-1k, we achieve state-of-the-art top-1 accuracy of 78.5\%, 79.8\%, 84.0\%, and 86.2\% with models containing 10M, 19M, 83M, and 173M parameters, respectively. For instance, the 10M model outperforms the best existing SNN by 7.2\% on ImageNet, with training time acceleration and inference energy efficiency improved by 4.5 and 3.9, respectively. We validate the effectiveness and efficiency of the proposed method across various tasks, including object detection, semantic segmentation, and neuromorphic vision tasks. This work enables SNNs to match ANN performance while maintaining the low-power advantage, marking a significant step towards SNNs as a general visual backbone. Code is available at https://github.com/BICLab/Spike-Driven-Transformer-V3.

Paper Structure

This paper contains 26 sections, 1 theorem, 24 equations, 12 figures, 5 tables.

Key Result

Proposition 1

Consider the $\text{Fire}_D(\cdot)$ function at $l$-th layer in SNN, its integer-value output is equal to the sum of spikes generated by IF-SR spiking neuron with $D$ timesteps: where $\mathbf{S}^l_D$ is the integer value fired by $\text{Fire}_D(\cdot)$ at $l$-th layer and $\{\hat{\mathbf{S}}^l[d]\}_D$ is the spike train generated by IF-SR spiking neuron over given $D$ timesteps with $V_{th}=1$.

Figures (12)

  • Figure 1: E-SpikeFormer versus other spiking Transformers on ImageNet-1k at $224^2$ input spatial resolution.
  • Figure 2: The impact of the reset mechanism on the spatio-temporal dynamics of spiking neurons. (a) The $(t-1)$-th and $(t+2)$-th inputs are greater than the threshold, the $t$-th and $(t+1)$-th inputs are not. Here we set $\beta = 1$. (b) If no reset, the spiking neuron will keep firing after the input exceeds the threshold at some point. (c) If hard reset to zero, a spike will be fired whenever the spatial input at the current timestep exceeds the threshold. Inputs that do not exceed the threshold will not trigger a spike. (d) If there is a soft reset, the membrane potential at $t$-th timestep contains part of the information of $(t-1)$-th timestep; even if the input at $t$-th is small, a spike may be fired.
  • Figure 3: Spike firing patterns. Assuming an approximation of 0.3 ($a^l_D=0.3$), i.e., three spikes need to be fired at ten timesteps ($D=10$). The timesteps at which the three spikes appear are synchronized and randomized for ANN2SNN and direct training SNN, where synchronization means that all ten timesteps must be completed to compute the spike firing rate. In contrast, SFA training will not fire spikes on the remaining seven timesteps after firing them on the first three timesteps. Therefore, the spike firing in SFA can be realized in an asynchronous manner (detailed in Section \ref{['sec_hardware_analysis']}).
  • Figure 4: The overview of E-SpikeFormer. In general, we follow the design of Meta-SpikeFormer meta_spikeformer on a macro level, such as CNN-based and Transformer-based SNN blocks and their proportions at each stage. We redesign the interior of CNN-based and Transformer SNN blocks with the goal of efficient architecture. In this figure, we have marked all upgraded designs in green or pink. In the green part, we insert a BN layer and a spiking neuron layer after the DWConv layer in the CNN-based block to avoid the increase in power caused by the larger convolution kernel size after re-parameterization. In the pink part, the Re-parameterization Convolution (RepConv) in meta_spikeformer is replaced by a linear layer to reduce power, and we compensate for the potential performance loss by expanding the channel number of $V_{S}$. In addition, we add a Spike Separable Convolution (SpikeSepConv) module to the Transformer-based SNN block in meta_spikeformer, which helps performance.
  • Figure 5: The overview of MIM pre-train in E-SpikeFormer. It consists of a SNN encoder and a ANN decoder. The encoder processes only the visible pixels. The decoder reconstructs the images to learn representations, which will be removed in the fine-tuning stage. Using Vanilla Spike Convolution (VSC) as an encoder will lead to information leakage, so we designed Spike Sparse Convolution (SSC) to avoid this flaw. White and black in the figure represent the one/zero area respectively. Exploiting VSC will cause the black zero area to continue to grow as the depth increases until there is no valid information. We show that SSC is a natural fit for neuromorphic chips due to the spike-driven nature of SNNs (Section \ref{['subsec_scale_2']}).
  • ...and 7 more figures

Theorems & Definitions (4)

  • Proposition 1
  • proof
  • Definition 1: Forward Approximation Error
  • Definition 2: Backward Gradient Error