Table of Contents
Fetching ...

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

Xinyu Shi, Zecheng Hao, Zhaofei Yu

TL;DR

This work proposes a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA), which combines the ResNet-based multi-stage architecture with the proposed DSSA to improve both performance and energy efficiency while reducing parameters.

Abstract

The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs, they lack reasonable scaling methods, and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges, we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer, which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN field.

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

TL;DR

This work proposes a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA), which combines the ResNet-based multi-stage architecture with the proposed DSSA to improve both performance and energy efficiency while reducing parameters.

Abstract

The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs, they lack reasonable scaling methods, and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges, we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer, which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN field.
Paper Structure (24 sections, 3 theorems, 35 equations, 3 figures, 8 tables)

This paper contains 24 sections, 3 theorems, 35 equations, 3 figures, 8 tables.

Key Result

Theorem 1

Given spike input ${\bf X}\in\{0,1\}^{T\times p\times m}$, ${\bf Y}\in\{0,1\}^{T\times m\times q}$ and linear transformation $f(\cdot)$ with weight matrix ${\bf W}\in \mathbb{R}^{q\times q}$, ${\bf I}\in \mathbb{R}^{T\times p\times q}$ is the output of DST, ${\bf I}={\rm DST}({\bf X},{\bf Y};f(\cdot

Figures (3)

  • Figure 1: Comparison of Top-1 accuracy on ImageNet with respect to energy consumption per image for inference (left) and the number of parameters (right). The input size is 224$\times$224.
  • Figure 2: Left: Architecture of SpikingResformer and components including Dual Spike Self-Attention (DSSA), Multi-Head DSSA (MHDSSA), and Group-Wise Spiking Feed-Forward Network (GWSFFN). Right: Architecture of SEW ResNet-50 and Spikformer.
  • Figure S1: Diagram of the equivalence of convolution to linear transformation. Top: ${\rm Conv_p}(\cdot)$ on a 4$\times$4 input where $p=2$; Bottom: Its equivalent linear transformation.

Theorems & Definitions (7)

  • Theorem 1: Mean and variance of DST
  • Definition 1
  • Lemma 1
  • proof
  • Lemma 2
  • proof
  • proof