Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket

Zhaokun Zhou; Kaiwei Che; Wei Fang; Keyu Tian; Yuesheng Zhu; Shuicheng Yan; Yonghong Tian; Li Yuan

Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket

Zhaokun Zhou, Kaiwei Che, Wei Fang, Keyu Tian, Yuesheng Zhu, Shuicheng Yan, Yonghong Tian, Li Yuan

TL;DR

Spikformer V2 demonstrates that integrating Spiking Self-Attention with a Spiking Convolutional Stem and self-supervised pre-training yields ImageNet accuracies surpassing 80% with low latency, marking a milestone for high-performance, energy-efficient SNNs. By replacing softmax-based attention with spike-driven SSA and introducing a deeper, convolutional stem, the model achieves competitive performance while dramatically reducing computation. The MAE-inspired SSL pre-training enables training larger Spikformer V2 architectures, delivering notable accuracy gains (up to 81.10% with 1 time step) and faster training, with strong results on CIFAR and neuromorphic datasets as well. Overall, the work advances SNN-based vision by combining biologically plausible processing, efficient spike-based attention, and scalable self-supervision to approach ANN performance with far lower energy demands.

Abstract

Spiking Neural Networks (SNNs), known for their biologically plausible architecture, face the challenge of limited performance. The self-attention mechanism, which is the cornerstone of the high-performance Transformer and also a biologically inspired structure, is absent in existing SNNs. To this end, we explore the potential of leveraging both self-attention capability and biological properties of SNNs, and propose a novel Spiking Self-Attention (SSA) and Spiking Transformer (Spikformer). The SSA mechanism eliminates the need for softmax and captures the sparse visual feature employing spike-based Query, Key, and Value. This sparse computation without multiplication makes SSA efficient and energy-saving. Further, we develop a Spiking Convolutional Stem (SCS) with supplementary convolutional layers to enhance the architecture of Spikformer. The Spikformer enhanced with the SCS is referred to as Spikformer V2. To train larger and deeper Spikformer V2, we introduce a pioneering exploration of Self-Supervised Learning (SSL) within the SNN. Specifically, we pre-train Spikformer V2 with masking and reconstruction style inspired by the mainstream self-supervised Transformer, and then finetune the Spikformer V2 on the image classification on ImageNet. Extensive experiments show that Spikformer V2 outperforms other previous surrogate training and ANN2SNN methods. An 8-layer Spikformer V2 achieves an accuracy of 80.38% using 4 time steps, and after SSL, a 172M 16-layer Spikformer V2 reaches an accuracy of 81.10% with just 1 time step. To the best of our knowledge, this is the first time that the SNN achieves 80+% accuracy on ImageNet. The code will be available at Spikformer V2.

Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket

TL;DR

Abstract

Paper Structure (24 sections, 13 equations, 8 figures, 7 tables)

This paper contains 24 sections, 13 equations, 8 figures, 7 tables.

Introduction
Background and Related Work
Spikformer
Overall Architecture
Spiking Patch Splitting (SPS)
Spiking Self Attention Mechanism (SSA)
Spikformer V2
Spiking Convolutional Stem (SCS)
Self-supervised Pre-training
Experiments
Supervised learning experiments on ImageNet
Experimental Settings.
Results.
Self-Supervised Learning on ImageNet.
Experimental Settings.
...and 9 more sections

Figures (8)

Figure 1: ImageNet-1K classification results for $\bullet$previous SNNs, $\bullet$Spikformer V1, $\bullet$Spikformer V2 and $\bullet$Vision Transformers. The diameter of each bubble is logarithmically proportional to the theoretical energy consumption. We demonstrate that the Spikformer V2 can achieve equivalent classification accuracy levels to those of ANN-Transformers, while maintaining lower theoretical energy consumption.
Figure 2: Comparison between Vanilla Self-Attention (VSA) and our Spiking Self-Attention (SSA). A red spike indicates a value of 1 at a specific location. The blue dashed boxes demonstrate examples of matrix dot product operations. For simplicity, we select one of the heads of SSA, where $N$ represents the number of input patches and $d$ denotes the feature dimension of one head. FLOPs stands for floating-point operations, and SOPs represents the theoretical synaptic operations. The theoretical energy consumption for performing one calculation between Query, Key, and Value in one time step is derived from the 8-encoder-blocks 512-embedding-dimension Spikformer on the test set of ImageNet, using the method in kundu2021hirehu2018residualzhou2023spikformer. (a) In VSA, $Q_{\mathcal{F}},K_{\mathcal{F}},V_{\mathcal{F}}$ are in float-point forms. After the dot-product of $Q_{\mathcal{F}}$ and $K_{\mathcal{F}}$, the softmax function regularizes the attention map values to be positive. (b) In SSA, all values in the attention map are non-negative and the computation is sparse using spike-form $Q, K, V$ ($5.5\times 10^6$ VS. $77 \times 10^6$ in VSA). As a result, SSA consumes less energy compared to VSA ($354.2\mu \rm{J}$). The SSA is decomposable (the calculation order of $Q,K$ and $V$ is changeable).
Figure 3: The architecture of Spiking Transformer (Spikformer) includes a spiking patch splitting module (SPS), a Spikformer encoder, and a Linear classification head. We observe that layer normalization (LN) is not suitable for SNNs, hence we empirically utilize batch normalization (BN) instead.
Figure 4: Redesign of patch-splitting module. SPS represents the ImageNet accuracy achieved using Spikformer with SPS. When we remove max-polling, the accuracy of SCS with Linear layers sees a significant drop (the second row), whereas SCS with Convolutional layers performed better and reached its peak with the two-layer convolution configuration. All models have approximately the same number of parameters. GSOPs represents the theoretical synaptic operations.
Figure 5: Comparison between the Spiking Patch Splitting (SPS) and the Spiking Convolutional Stem (SCS). (a) In each block of the SPS module, the initial operation involves applying a 2D convolution with a kernel size of 3 and a stride of 1. This is followed by a max-pooling operation for downsampling by a factor of 2, which may lead to a potential loss of feature information. (b) In each block of the SCS module, the initial step involves downsampling using a 2D convolution with a kernel size of 2 and a stride of 2. Additionally, we augment each SCS block with a convolutional block structure similar to the MLP Block.
...and 3 more figures

Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket

TL;DR

Abstract

Spikformer V2: Join the High Accuracy Club on ImageNet with an SNN Ticket

Authors

TL;DR

Abstract

Table of Contents

Figures (8)