Spiking Transformer with Spatial-Temporal Attention

Donghyun Lee; Yuhang Li; Youngeun Kim; Shiting Xiao; Priyadarshini Panda

Spiking Transformer with Spatial-Temporal Attention

Donghyun Lee, Yuhang Li, Youngeun Kim, Shiting Xiao, Priyadarshini Panda

TL;DR

This work introduces STAtten, a block-wise spatial-temporal attention mechanism for spike-based transformers that preserves the original computational complexity $O(TND^2)$ while incorporating temporal dependencies. Through entropy-based analysis, STAtten demonstrates more structured feature representations than spatial-only attention and is shown to improve accuracy across static and neuromorphic vision datasets when plugged into existing spike-based backbones. The approach maintains energy efficiency and memory advantages of spike-based computation and achieves state-of-the-art or competitive results on CIFAR/ImageNet and neuromorphic datasets such as CIFAR10-DVS and N-Caltech101. While hardware deployment on traditional neuromorphic chips remains challenging, the proposed block-wise strategy and compatibility with multiple backbones position STAtten as a practical enhancement for energy-efficient, temporally-aware neuromorphic vision models.

Abstract

Spike-based Transformer presents a compelling and energy-efficient alternative to traditional Artificial Neural Network (ANN)-based Transformers, achieving impressive results through sparse binary computations. However, existing spike-based transformers predominantly focus on spatial attention while neglecting crucial temporal dependencies inherent in spike-based processing, leading to suboptimal feature representation and limited performance. To address this limitation, we propose Spiking Transformer with Spatial-Temporal Attention (STAtten), a simple and straightforward architecture that efficiently integrates both spatial and temporal information in the self-attention mechanism. STAtten introduces a block-wise computation strategy that processes information in spatial-temporal chunks, enabling comprehensive feature capture while maintaining the same computational complexity as previous spatial-only approaches. Our method can be seamlessly integrated into existing spike-based transformers without architectural overhaul. Extensive experiments demonstrate that STAtten significantly improves the performance of existing spike-based transformers across both static and neuromorphic datasets, including CIFAR10/100, ImageNet, CIFAR10-DVS, and N-Caltech101. The code is available at https://github.com/Intelligent-Computing-Lab-Yale/STAtten

Spiking Transformer with Spatial-Temporal Attention

TL;DR

This work introduces STAtten, a block-wise spatial-temporal attention mechanism for spike-based transformers that preserves the original computational complexity

while incorporating temporal dependencies. Through entropy-based analysis, STAtten demonstrates more structured feature representations than spatial-only attention and is shown to improve accuracy across static and neuromorphic vision datasets when plugged into existing spike-based backbones. The approach maintains energy efficiency and memory advantages of spike-based computation and achieves state-of-the-art or competitive results on CIFAR/ImageNet and neuromorphic datasets such as CIFAR10-DVS and N-Caltech101. While hardware deployment on traditional neuromorphic chips remains challenging, the proposed block-wise strategy and compatibility with multiple backbones position STAtten as a practical enhancement for energy-efficient, temporally-aware neuromorphic vision models.

Abstract

Paper Structure (24 sections, 18 equations, 5 figures, 12 tables)

This paper contains 24 sections, 18 equations, 5 figures, 12 tables.

Introduction
Related Works
Preliminary
Methodology
Motivation of Spatial-Temporal Attention
STAtten Mechanism
STAtten with Existing Spiking Transformers
Complexity and Energy of Self-attention
Experiments
Sequential CIFAR10/100 Classification
Performance Analysis
Memory and Energy Analysis
Model Capacity
Limitation
Conclusion
...and 9 more sections

Figures (5)

Figure 1: Heatmaps of spatial-only attention versus our STAtten on sequential CIFAR100 dataset. Input images are divided column-wise, where each column corresponds to one timestep.
Figure 2: Comparison between different self-attentions with the CIFAR100 dataset. Analysis of entropy and accuracy for Temporal-only (T), Spatial-only (S), and Spatial-temporal (ST).
Figure 3: (a) Maximum batch size of block-wise STAtten and full spatial-temporal attention (without block partitioning) for running on A5000 GPU with 24GB VRAM memory. (b) Average number of active neurons after QKV computation at different timestep combinations, where [t] indicates timestep index.
Figure 4: Overview of STAtten architecture. (a) Block-wise temporal attention mechanism. Binary Q, K, V tensors are partitioned into temporal blocks, where black lines indicate paired timestep processing. (b) Computation flow with tensor dimensions, where $T$ is the number of timesteps, $N$ is the number of tokens, $B$ is block size, and $D$ is the feature dimension.
Figure 5: Accuracy comparison with respect to the number of parameters on (a) CIFAR100 and (b) Sequential CIFAR100 datasets. Spike-driven Transformer yao2024spike is used as the baseline of spatial-only architecture.

Spiking Transformer with Spatial-Temporal Attention

TL;DR

Abstract

Spiking Transformer with Spatial-Temporal Attention

Authors

TL;DR

Abstract

Table of Contents

Figures (5)