SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

Xinyu Shi; Zecheng Hao; Zhaofei Yu

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

Xinyu Shi, Zecheng Hao, Zhaofei Yu

TL;DR

This work proposes a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA), which combines the ResNet-based multi-stage architecture with the proposed DSSA to improve both performance and energy efficiency while reducing parameters.

Abstract

The remarkable success of Vision Transformers in Artificial Neural Networks (ANNs) has led to a growing interest in incorporating the self-attention mechanism and transformer-based architecture into Spiking Neural Networks (SNNs). While existing methods propose spiking self-attention mechanisms that are compatible with SNNs, they lack reasonable scaling methods, and the overall architectures proposed by these methods suffer from a bottleneck in effectively extracting local features. To address these challenges, we propose a novel spiking self-attention mechanism named Dual Spike Self-Attention (DSSA) with a reasonable scaling method. Based on DSSA, we propose a novel spiking Vision Transformer architecture called SpikingResformer, which combines the ResNet-based multi-stage architecture with our proposed DSSA to improve both performance and energy efficiency while reducing parameters. Experimental results show that SpikingResformer achieves higher accuracy with fewer parameters and lower energy consumption than other spiking Vision Transformer counterparts. Notably, our SpikingResformer-L achieves 79.40% top-1 accuracy on ImageNet with 4 time-steps, which is the state-of-the-art result in the SNN field.

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

TL;DR

Abstract

Paper Structure (24 sections, 3 theorems, 35 equations, 3 figures, 8 tables)

This paper contains 24 sections, 3 theorems, 35 equations, 3 figures, 8 tables.

Introduction
Related Work
Preliminary
Dual Spike Self-Attention
Vanilla Self-Attention
Dual Spike Self-Attention
Scaling Factors in DSSA
Spike-driven Characteristic of DSSA
SpikingResformer
Overall Architecture
Spiking Resformer Block
Experiments
ImageNet Classification
Ablation Study
Transfer Learning
...and 9 more sections

Key Result

Theorem 1

Given spike input ${\bf X}\in\{0,1\}^{T\times p\times m}$, ${\bf Y}\in\{0,1\}^{T\times m\times q}$ and linear transformation $f(\cdot)$ with weight matrix ${\bf W}\in \mathbb{R}^{q\times q}$, ${\bf I}\in \mathbb{R}^{T\times p\times q}$ is the output of DST, ${\bf I}={\rm DST}({\bf X},{\bf Y};f(\cdot

Figures (3)

Figure 1: Comparison of Top-1 accuracy on ImageNet with respect to energy consumption per image for inference (left) and the number of parameters (right). The input size is 224$\times$224.
Figure 2: Left: Architecture of SpikingResformer and components including Dual Spike Self-Attention (DSSA), Multi-Head DSSA (MHDSSA), and Group-Wise Spiking Feed-Forward Network (GWSFFN). Right: Architecture of SEW ResNet-50 and Spikformer.
Figure S1: Diagram of the equivalence of convolution to linear transformation. Top: ${\rm Conv_p}(\cdot)$ on a 4$\times$4 input where $p=2$; Bottom: Its equivalent linear transformation.

Theorems & Definitions (7)

Theorem 1: Mean and variance of DST
Definition 1
Lemma 1
proof
Lemma 2
proof
proof

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

TL;DR

Abstract

SpikingResformer: Bridging ResNet and Vision Transformer in Spiking Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (3)

Theorems & Definitions (7)