Table of Contents
Fetching ...

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

Chenlin Zhou, Han Zhang, Zhaokun Zhou, Liutao Yu, Liwei Huang, Xiaopeng Fan, Li Yuan, Zhengyu Ma, Huihui Zhou, Yonghong Tian

TL;DR

QKFormer, a hierarchical spiking transformer based on Q-K attention based on existing state-of-the-art SNN models on various mainstream datasets, achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%.

Abstract

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer

QKFormer: Hierarchical Spiking Transformer using Q-K Attention

TL;DR

QKFormer, a hierarchical spiking transformer based on Q-K attention based on existing state-of-the-art SNN models on various mainstream datasets, achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%.

Abstract

Spiking Transformers, which integrate Spiking Neural Networks (SNNs) with Transformer architectures, have attracted significant attention due to their potential for energy efficiency and high performance. However, existing models in this domain still suffer from suboptimal performance. We introduce several innovations to improve the performance: i) We propose a novel spike-form Q-K attention mechanism, tailored for SNNs, which efficiently models the importance of token or channel dimensions through binary vectors with linear complexity. ii) We incorporate the hierarchical structure, which significantly benefits the performance of both the brain and artificial neural networks, into spiking transformers to obtain multi-scale spiking representation. iii) We design a versatile and powerful patch embedding module with a deformed shortcut specifically for spiking transformers. Together, we develop QKFormer, a hierarchical spiking transformer based on Q-K attention with direct training. QKFormer shows significantly superior performance over existing state-of-the-art SNN models on various mainstream datasets. Notably, with comparable size to Spikformer (66.34 M, 74.81%), QKFormer (64.96 M) achieves a groundbreaking top-1 accuracy of 85.65% on ImageNet-1k, substantially outperforming Spikformer by 10.84%. To our best knowledge, this is the first time that directly training SNNs have exceeded 85% accuracy on ImageNet-1K. The code and models are publicly available at https://github.com/zhouchenlin2096/QKFormer
Paper Structure (24 sections, 21 equations, 6 figures, 10 tables)

This paper contains 24 sections, 21 equations, 6 figures, 10 tables.

Figures (6)

  • Figure 1: Illustration of Q-K attention with the two versions of Q-K token attention (QKTA) and Q-K channel attention (QKCA). The inputs are binary spikes and there are only sparse additions and mask operations in Q-K attention. As a spike-driven module, Q-K attention efficiently models the token or channel attention through spike-form binary vectors, performing linear complexity to #tokens (or #channels) and high energy efficiency. Spiking Neuron (SN) in this work adopts the Leaky-Integrate-and-Fire (LIF) model, which is shown in Appendix. \ref{['LIF']}.
  • Figure 2: The overview of QKFormer, a hierarchical spiking transformer with Q-K attention.
  • Figure 3: The visualization and memory consumption of QKTA. \ref{['fig: akta_a']} is the visualization of Q-K token attention. The white dot means value 1, while the black one means value 0. \ref{['fig: qkta_b']} shows the comparison of memory costs between QKTA and SSA under different token numbers. $N$ is the token number.
  • Figure 4: (a) shows the variance and expectation of SSA, (b) shows the variance and expectation of QKTA. Assume that all the spike elements (either 0 or 1) in SSA and QKTA are independent random variables and subject to Bernoulli distribution.
  • Figure 5: (a) Spiking Patch Splitting (SPS) module in Spikformer. (b) Spiking Patch Embedding with Deformed Shortcut (SPEDS) module in QKFormer.
  • ...and 1 more figures