Learning Long Sequences in Spiking Neural Networks

Matei Ioan Stan; Oliver Rhodes

Learning Long Sequences in Spiking Neural Networks

Matei Ioan Stan, Oliver Rhodes

TL;DR

This work investigates bringing state space models (SSMs) to spike-based neural networks (SNNs) for long-range sequence modelling, addressing both efficiency and accuracy on neuromorphic hardware. By introducing Binary SSM (Binary S4D) and the Gated Spiking Unit (GSU), the authors enable efficient, addition-based feature mixing with non-differentiable spikes while mitigating vanishing gradients. Empirically, SSM-based SNNs outperform Transformers on the Long Range Arena and achieve state-of-the-art SNN performance on sequential MNIST, albeit with a gap to non-binarised baselines that is bridged by the GSU and non-saturating activations. The findings suggest that saturating spiking activations are a key limitation for scaling SNNs to long sequences, and that non-binary, non-saturating forward operations can preserve energy efficiency while maintaining strong performance, paving the way for neuromorphic deployment of large-scale SSMs.

Abstract

Spiking neural networks (SNNs) take inspiration from the brain to enable energy-efficient computations. Since the advent of Transformers, SNNs have struggled to compete with artificial networks on modern sequential tasks, as they inherit limitations from recurrent neural networks (RNNs), with the added challenge of training with non-differentiable binary spiking activations. However, a recent renewed interest in efficient alternatives to Transformers has given rise to state-of-the-art recurrent architectures named state space models (SSMs). This work systematically investigates, for the first time, the intersection of state-of-the-art SSMs with SNNs for long-range sequence modelling. Results suggest that SSM-based SNNs can outperform the Transformer on all tasks of a well-established long-range sequence modelling benchmark. It is also shown that SSM-based SNNs can outperform current state-of-the-art SNNs with fewer parameters on sequential image classification. Finally, a novel feature mixing layer is introduced, improving SNN accuracy while challenging assumptions about the role of binary activations in SNNs. This work paves the way for deploying powerful SSM-based architectures, such as large language models, to neuromorphic hardware for energy-efficient long-range sequence modelling.

Learning Long Sequences in Spiking Neural Networks

TL;DR

Abstract

Paper Structure (18 sections, 20 equations, 6 figures, 4 tables)

This paper contains 18 sections, 20 equations, 6 figures, 4 tables.

Introduction
Results
LRA Accuracy
Sequential MNIST Accuracy
Effect of Surrogate Gradient Function
Baseline Saturating Activations
Discussion
Methods
Leaky Integrate-and-Fire Neurons
State Space Models
State Space Initialisation
Binary S4D
Surrogate Gradients
Gated Spiking Unit
LRA Experimental Setup
...and 3 more sections

Figures (6)

Figure 1: Example Computational Graphs for Sequence Models. Subfigure \ref{['fig:unroll_rnn']}, shows how basic RNNs perform computations over time. Of note is the inclusion of nonlinearities between time steps, which entail iterative computations. In addition, one can observe how, during the backward pass using BPTT, credit assignment between time steps $\frac{\partial h_{p}}{\partial h_{q}}$, where $q << p$, involves numerous repeated multiplications which can cause vanishing or exploding gradients. Subfigure \ref{['fig:unroll_snn']}, highlights the structural similarities between SNNs and RNNs. One important difference stems from the addition of a linear recurrence based on leaky membrane voltages in neurons such as Leaky Integrate-and-Fire neurons in SNNs. Moreover, the defining feature of SNNs is the neuron outputs consisting of sparse binary spike trains. Subfigure \ref{['fig:self_attention']}, underlines the parallel nature of Transformers, where input history is no longer compressed within an evolving network state. The attention matrix containing all pair-wise similarities between tokens in the input sequence is multiplied with the $V$ projection of the inputs in dense and large-scale matrix-matrix multiplication, which is unfavourable for neuromorphic hardware implementation. Subfigure \ref{['fig:unrolled_ssm']}, illustrates the dual interpretation of recurrences in linear time-invariant SSMs. In architectures such as S4 gu2021efficiently, individual SSM units are single-input single-output (SISO). The scalar input ($i_t$) is projected onto high-dimensional space using $B \in \mathbb{R}^{d}$ at each time step. The state of the model ($u_t$) evolves over time using the transition matrix $A \in \mathbb{R}^{d \times d}$. SSM-based neural networks use the initialisation of $A$ and $B$ to implicitly encode projections of input signals onto an orthogonal polynomial basis. To produce a scalar output ($y_t$), the state vector is linearly projected back onto a single dimension using a vector $C \in \mathbb{R}^{d}$.
Figure 2: Binary SSM Layer At each time step, a Binary SSM layer consists of independent single-input single-output (SISO) SSM "neurons". Binary activations are applied element-wise per each SSM output before position-wise feature mixing to avoid dense vector-matrix multiplication.
Figure 3: Input Scales Subfigure \ref{['fig:original_samples']} shows relative sizes of samples from (left to right) MNIST, CIFAR10 and Path-X, with respective resolutions of 28x28 (784), 32x32 (1024), and 128x128 (16384). Subfigure \ref{['fig:smnist_bellec']} shows the flattening process used in all image-based sequential tasks (adapted from bellec2018long). Subfigure \ref{['fig:flattened_samples']} visualises how the lengths of the flattened image samples compare. One can easily observe from \ref{['fig:original_samples']} and \ref{['fig:flattened_samples']} that Path-X contains input sequences more than twenty times longer than sequential MNIST, commonly used for probing SNN long-range dependencies.
Figure 4: Accuracy on the LRA benchmark. Binary S4D performs on average more than 10% worse than the baseline but still over 20% better than the Transformer. The GSU achieves 1.06% lower accuracy than the baseline on average, and at most just 2.83% below the baseline on Image. On Path-X, Binary S4D has 30% lower accuracy than the baseline yet still manages to outperform the Transformer by 11.2%.
Figure 5: Convergence on Path-X. Applying saturating activation functions to SSM outputs leads to reduced accuracy on Path-X, similar to binary spiking activations.
...and 1 more figures

Learning Long Sequences in Spiking Neural Networks

TL;DR

Abstract

Learning Long Sequences in Spiking Neural Networks

Authors

TL;DR

Abstract

Table of Contents

Figures (6)