Fourier or Wavelet bases as counterpart self-attention in spikformer for efficient visual classification
Qingyu Wang, Duzhen Zhang, Tilelin Zhang, Bo Xu
TL;DR
The paper addresses the high cost of self-attention in spiking Transformer models by replacing vanilla Spiking Self-Attention with spike-form Fourier or Wavelet transforms. It introduces FWformer, which uses fixed or learned combinations of Fourier/Wavelet bases to transform spike sequences, reducing time complexity from $O(N^2)$ to $O(N\log N)$ (and analogous gains in 2D) while maintaining or improving accuracy. Empirical results on event-based video datasets (CIFAR10-DVS, DvsGesture) and static image datasets (CIFAR10/100) show comparable or better accuracy, along with substantial gains in training/inference speed and energy efficiency, and reduced memory usage. The work also analyzes the orthogonality of bases during training and demonstrates that fixed non-orthogonal combinations can yield further accuracy improvements, highlighting a promising direction for efficient, theory-informed Transformer design in spike-based computation.
Abstract
Energy-efficient spikformer has been proposed by integrating the biologically plausible spiking neural network (SNN) and artificial Transformer, whereby the Spiking Self-Attention (SSA) is used to achieve both higher accuracy and lower computational cost. However, it seems that self-attention is not always necessary, especially in sparse spike-form calculation manners. In this paper, we innovatively replace vanilla SSA (using dynamic bases calculating from Query and Key) with spike-form Fourier Transform, Wavelet Transform, and their combinations (using fixed triangular or wavelets bases), based on a key hypothesis that both of them use a set of basis functions for information transformation. Hence, the Fourier-or-Wavelet-based spikformer (FWformer) is proposed and verified in visual classification tasks, including both static image and event-based video datasets. The FWformer can achieve comparable or even higher accuracies ($0.4\%$-$1.5\%$), higher running speed ($9\%$-$51\%$ for training and $19\%$-$70\%$ for inference), reduced theoretical energy consumption ($20\%$-$25\%$), and reduced GPU memory usage ($4\%$-$26\%$), compared to the standard spikformer. Our result indicates the continuous refinement of new Transformers, that are inspired either by biological discovery (spike-form), or information theory (Fourier or Wavelet Transform), is promising.
