Table of Contents
Fetching ...

Fourier or Wavelet bases as counterpart self-attention in spikformer for efficient visual classification

Qingyu Wang, Duzhen Zhang, Tilelin Zhang, Bo Xu

TL;DR

The paper addresses the high cost of self-attention in spiking Transformer models by replacing vanilla Spiking Self-Attention with spike-form Fourier or Wavelet transforms. It introduces FWformer, which uses fixed or learned combinations of Fourier/Wavelet bases to transform spike sequences, reducing time complexity from $O(N^2)$ to $O(N\log N)$ (and analogous gains in 2D) while maintaining or improving accuracy. Empirical results on event-based video datasets (CIFAR10-DVS, DvsGesture) and static image datasets (CIFAR10/100) show comparable or better accuracy, along with substantial gains in training/inference speed and energy efficiency, and reduced memory usage. The work also analyzes the orthogonality of bases during training and demonstrates that fixed non-orthogonal combinations can yield further accuracy improvements, highlighting a promising direction for efficient, theory-informed Transformer design in spike-based computation.

Abstract

Energy-efficient spikformer has been proposed by integrating the biologically plausible spiking neural network (SNN) and artificial Transformer, whereby the Spiking Self-Attention (SSA) is used to achieve both higher accuracy and lower computational cost. However, it seems that self-attention is not always necessary, especially in sparse spike-form calculation manners. In this paper, we innovatively replace vanilla SSA (using dynamic bases calculating from Query and Key) with spike-form Fourier Transform, Wavelet Transform, and their combinations (using fixed triangular or wavelets bases), based on a key hypothesis that both of them use a set of basis functions for information transformation. Hence, the Fourier-or-Wavelet-based spikformer (FWformer) is proposed and verified in visual classification tasks, including both static image and event-based video datasets. The FWformer can achieve comparable or even higher accuracies ($0.4\%$-$1.5\%$), higher running speed ($9\%$-$51\%$ for training and $19\%$-$70\%$ for inference), reduced theoretical energy consumption ($20\%$-$25\%$), and reduced GPU memory usage ($4\%$-$26\%$), compared to the standard spikformer. Our result indicates the continuous refinement of new Transformers, that are inspired either by biological discovery (spike-form), or information theory (Fourier or Wavelet Transform), is promising.

Fourier or Wavelet bases as counterpart self-attention in spikformer for efficient visual classification

TL;DR

The paper addresses the high cost of self-attention in spiking Transformer models by replacing vanilla Spiking Self-Attention with spike-form Fourier or Wavelet transforms. It introduces FWformer, which uses fixed or learned combinations of Fourier/Wavelet bases to transform spike sequences, reducing time complexity from to (and analogous gains in 2D) while maintaining or improving accuracy. Empirical results on event-based video datasets (CIFAR10-DVS, DvsGesture) and static image datasets (CIFAR10/100) show comparable or better accuracy, along with substantial gains in training/inference speed and energy efficiency, and reduced memory usage. The work also analyzes the orthogonality of bases during training and demonstrates that fixed non-orthogonal combinations can yield further accuracy improvements, highlighting a promising direction for efficient, theory-informed Transformer design in spike-based computation.

Abstract

Energy-efficient spikformer has been proposed by integrating the biologically plausible spiking neural network (SNN) and artificial Transformer, whereby the Spiking Self-Attention (SSA) is used to achieve both higher accuracy and lower computational cost. However, it seems that self-attention is not always necessary, especially in sparse spike-form calculation manners. In this paper, we innovatively replace vanilla SSA (using dynamic bases calculating from Query and Key) with spike-form Fourier Transform, Wavelet Transform, and their combinations (using fixed triangular or wavelets bases), based on a key hypothesis that both of them use a set of basis functions for information transformation. Hence, the Fourier-or-Wavelet-based spikformer (FWformer) is proposed and verified in visual classification tasks, including both static image and event-based video datasets. The FWformer can achieve comparable or even higher accuracies (-), higher running speed (- for training and - for inference), reduced theoretical energy consumption (-), and reduced GPU memory usage (-), compared to the standard spikformer. Our result indicates the continuous refinement of new Transformers, that are inspired either by biological discovery (spike-form), or information theory (Fourier or Wavelet Transform), is promising.
Paper Structure (22 sections, 14 equations, 2 figures, 5 tables)

This paper contains 22 sections, 14 equations, 2 figures, 5 tables.

Figures (2)

  • Figure 1: The overall architecture of our proposed FWformer. It mainly consists of three components: (1) Spiking Patch Splitting (SPS) module, (2) FWformer Encoder Layer, and (3) Classification Layer. Additionally, we highlight the similarities between the FW head and SSA head at a single time step, which inspires us to choose the former as an exploration for more efficient calculations within the spike-form framework.
  • Figure 2: (A) We treat spiking self-attention as a set of basis functions and proceed to measure the changes in their orthogonality throughout the training process. (B) A diagram visualizing how the basis functions, spanning a feature space, are transformed from orthogonal to non-orthogonal, with only two axes used for simplification. (C) A diagram visualizing our endeavor to employ fixed non-orthogonal bases.