Table of Contents
Fetching ...

Toward Relative Positional Encoding in Spiking Transformers

Changze Lv, Yansen Wang, Dongqi Han, Yifei Shen, Xiaoqing Zheng, Xuanjing Huang, Dongsheng Li

TL;DR

This work tackles the challenge of integrating relative positional information into spiking Transformers without violating binary spike dynamics. It introduces Gray-PE, leveraging a provable Gray-code distance property, and Log-PE, introducing a logarithmic distance bias, along with an extended 2D variant for image patches. The methods are implemented via XNOR-based self-attention and shown to yield consistent gains across time-series forecasting, text classification, and patch-based image classification on multiple SNN backbones, with theoretical and empirical support. The results suggest that relative positional encoding, when adapted to neuromorphic constraints, can significantly enhance sequential modeling capabilities in SNNs and broaden their applicability to real-world tasks.

Abstract

Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (spiking Transformers) have recently shown great advancements in various tasks, and inspired by traditional Transformers, several studies have demonstrated that spiking absolute positional encoding can help capture sequential relationships for input data, enhancing the capabilities of spiking Transformers for tasks such as sequential modeling and image classification. However, how to incorporate relative positional information into SNNs remains a challenge. In this paper, we introduce several strategies to approximate relative positional encoding in spiking Transformers while preserving the binary nature of spikes. Firstly, we formally prove that encoding relative distances with Gray Code ensures that the binary representations of positional indices maintain a constant Hamming distance whenever their decimal values differ by a power of two, and we propose Gray-PE based on this property. In addition, we propose another RPE method called Log-PE, which combines the logarithmic form of the relative distance matrix directly into the spiking attention map. Furthermore, we extend our RPE methods to a two-dimensional form, making them suitable for processing image patches. We evaluate our RPE methods on various tasks, including time series forecasting, text classification, and patch-based image classification, and the experimental results demonstrate a satisfying performance gain by incorporating our RPE methods across many architectures. Our results provide fresh perspectives on designing spiking Transformers to advance their sequential modeling capability, thereby expanding their applicability across various domains.

Toward Relative Positional Encoding in Spiking Transformers

TL;DR

This work tackles the challenge of integrating relative positional information into spiking Transformers without violating binary spike dynamics. It introduces Gray-PE, leveraging a provable Gray-code distance property, and Log-PE, introducing a logarithmic distance bias, along with an extended 2D variant for image patches. The methods are implemented via XNOR-based self-attention and shown to yield consistent gains across time-series forecasting, text classification, and patch-based image classification on multiple SNN backbones, with theoretical and empirical support. The results suggest that relative positional encoding, when adapted to neuromorphic constraints, can significantly enhance sequential modeling capabilities in SNNs and broaden their applicability to real-world tasks.

Abstract

Spiking neural networks (SNNs) are bio-inspired networks that mimic how neurons in the brain communicate through discrete spikes, which have great potential in various tasks due to their energy efficiency and temporal processing capabilities. SNNs with self-attention mechanisms (spiking Transformers) have recently shown great advancements in various tasks, and inspired by traditional Transformers, several studies have demonstrated that spiking absolute positional encoding can help capture sequential relationships for input data, enhancing the capabilities of spiking Transformers for tasks such as sequential modeling and image classification. However, how to incorporate relative positional information into SNNs remains a challenge. In this paper, we introduce several strategies to approximate relative positional encoding in spiking Transformers while preserving the binary nature of spikes. Firstly, we formally prove that encoding relative distances with Gray Code ensures that the binary representations of positional indices maintain a constant Hamming distance whenever their decimal values differ by a power of two, and we propose Gray-PE based on this property. In addition, we propose another RPE method called Log-PE, which combines the logarithmic form of the relative distance matrix directly into the spiking attention map. Furthermore, we extend our RPE methods to a two-dimensional form, making them suitable for processing image patches. We evaluate our RPE methods on various tasks, including time series forecasting, text classification, and patch-based image classification, and the experimental results demonstrate a satisfying performance gain by incorporating our RPE methods across many architectures. Our results provide fresh perspectives on designing spiking Transformers to advance their sequential modeling capability, thereby expanding their applicability across various domains.

Paper Structure

This paper contains 45 sections, 1 theorem, 24 equations, 3 figures, 8 tables.

Key Result

Theorem 1

(Proof in Appendix app:proof) For two position indices differing by $2^n (n\geq0)$, their Gray Code representations have a consistent Hamming distance. Specifically, $\forall$ position $i$, we have:

Figures (3)

  • Figure 1: Illustration of preliminary knowledge. (a) Spike dynamics of LIF neurons. (b) Illustration of vanilla spiking self-attention in Spikformer Zhou2022SpikformerWS. (c) An example of Hamming Distance between two spike trains. (d) The calculation process of the classic Reflected Gray Code.
  • Figure 2: Overview of Our Method. (a) XNOR-based spiking self-attention. We illustrate the computation flow for $\mathbf{Q}$ and $\mathbf{K}$ in a PyTorch-style notation. (b) Gray-PE. Position indices differing by $2^n$ exhibit a consistent Hamming distance on their Gray code representations. Gray-PE is implemented by concatenating $G(\boldsymbol{l})$ along the $D$ dimension on both $\mathbf{Q}$ and $\mathbf{K}$. (c) Log-PE. A pre-assigned relative distance encoding map $\mathbf{R}_{i,j} \in \mathbb{N}_0$ is added to the original attention map $\mathbf{AttnMap}$. (d) 2D Form of Gray-PE. A 2D RPE is more suitable than the 1D version for image patches, as it captures the spatial relationships more effectively.
  • Figure 3: Spikformer-XNOR with Gray-PE across various bit numbers ranging from $2$ to $12$ on (a) time-series forecasting tasks and (b) text classification tasks.

Theorems & Definitions (3)

  • Theorem 1
  • Definition 1
  • proof