Loss shaping enhances exact gradient learning with Eventprop in spiking neural networks

Thomas Nowotny; James P. Turner; James C. Knight

Loss shaping enhances exact gradient learning with Eventprop in spiking neural networks

Thomas Nowotny, James P. Turner, James C. Knight

TL;DR

This work tackles the challenge of training spiking neural networks with exact gradients by extending the Eventprop algorithm to a broader class of loss functions, addressing the spike-deletion issue that limited SHD learning. Through loss shaping (L_F, including L_sum, L_sum_exp, and L_time) and targeted augmentations, the authors achieve strong SHD results (up to 93.5±0.7% test accuracy with LOSO cross-validation) and competitive SSC performance (74.1±0.9% test accuracy). The approach leverages a GeNN-based implementation to enable efficient forward and backward passes that scale with the number of spikes rather than timesteps, yielding significant speedups (≈3×) and memory reductions (≈4×) versus BPTT surrogates. The study demonstrates the practical viability of exact-gradient SNNs for keyword recognition tasks on neuromorphic hardware-relevant benchmarks, and outlines future directions toward deeper networks, learning delays, and more biologically plausible neuron models. The findings highlight loss-function design as a critical ingredient for successful exact-gradient learning in SNNs and pave the way for energy-efficient neuromorphic AI with scalable training pipelines.

Abstract

Event-based machine learning promises more energy-efficient AI on future neuromorphic hardware. Here, we investigate how the recently discovered Eventprop algorithm for gradient descent on exact gradients in spiking neural networks can be scaled up to challenging keyword recognition benchmarks. We implemented Eventprop in the GPU-enhanced Neural Networks framework and used it for training recurrent spiking neural networks on the Spiking Heidelberg Digits and Spiking Speech Commands datasets. We found that learning depended strongly on the loss function and extended Eventprop to a wider class of loss functions to enable effective training. We then tested a large number of data augmentations and regularisations as well as exploring different network structures; and heterogeneous and trainable timescales. We found that when combined with two specific augmentations, the right regularisation and a delay line input, Eventprop networks with one recurrent layer achieved state-of-the-art performance on Spiking Heidelberg Digits and good accuracy on Spiking Speech Commands. In comparison to a leading surrogate-gradient-based SNN training method, our GeNN Eventprop implementation is 3X faster and uses 4X less memory. This work is a significant step towards a low-power neuromorphic alternative to current machine learning paradigms.

Loss shaping enhances exact gradient learning with Eventprop in spiking neural networks

TL;DR

Abstract

Paper Structure (26 sections, 36 equations, 10 figures, 6 tables)

This paper contains 26 sections, 36 equations, 10 figures, 6 tables.

Introduction
Results
Additional loss functions in Eventprop
Spiking Heidelberg Digits
Applying machine learning tools for better accuracy
Spiking Speech Commands
Benchmarking against back-propagation through time and e-prop
Discussion
Materials and Methods
Phantom spike regularisation
Regularisation in the hidden layer
Dropout and Noise
Augmentations
Silent neurons
Learning rate ease-in
...and 11 more sections

Figures (10)

Figure 1: Relationship of spike times in LIF neurons and weights of incoming synapses. (A) voltage $V$ of a LIF neuron as a function of time in response to a single incoming spike through a synapse of weight $w$. The higher $w$ the earlier the spike threshold is crossed and a spike is emitted (red dots). For $w=6$ the threshold is never crossed. (B) Time of threshold crossing as a function of the incoming weight in the scenario shown in A. The red dots match those in panel A. With decreasing $w$ the spike time increases continuously but then stops abruptly at a critical weight value when the spike threshold can't be reached any more. Crucially, the slope of the curve is finite before this point (see inset) so that there is no indication in the gradient about the existence of the critical point.
Figure 2: Gedankenexperiment illustrating the problem of accidental gradient ascent. (A) minimal network for a two-class classification problem with one hidden layer. (B) original assumed scenario of spikes in hidden neurons, the resulting output voltages $V_0, V_1$ and the corresponding loss term $l_V(V_0,V_1)$. (C) fictitious scenario of spread-out spikes in hidden neuron $0$ that would have a lower loss. (D) actual outcome of spreading out spikes by lowering $w_a$ with accidentally deleted spikes and hence higher loss than at the beginning. (E)-(H)like (A)-(D) but considering the spikes in hidden neuron $1$ where the gradient points to moving spikes closer/ forward in time in hidden neuron $1$, leading to increases in $w_b$ and hence detrimental accidental creation of spikes.
Figure 3: Illustration of the mechanism that leads to unhelpful spike deletions in hidden neurons. (A) Spike raster of a typical input pattern of class 0. (B) Spike raster of the hidden layer in response to an input of class 0 (showing a subset of $30$ of $256$ neurons for better visibility). Red highlighted neurons are those that are most active on average for class 0 inputs and correspond to the panels shown in E. (C)$\lambda_V$ (orange), $\lambda_I$ (blue) of output neuron $0$ in the corresponding backwards pass plotted against forward time, i.e. integration proceeds from the right to the left. During backwards integration, $\lambda_V$ increases rapidly from 0 to the value corresponding to all output voltages being $0$ and $\lambda_I$ follows $\lambda_V$ (around t=1400, not shown). When the stored spikes are encountered, $\lambda_V$, $\lambda_I$ increase further as the model is not yet trained and the correct output voltage does not dominate in the response. (D) The difference $\lambda_V-\lambda_I$ of output $0$ that is transported to the neurons in the hidden layer. (E)$\lambda_V-\lambda_I$ values arriving at the four most active neurons (marked in red in B) when transported during a stored spike, shown as bars. The numbers indicate the sum of all bars, which relates to the direction of the total change in excitation the hidden neurons receive. All values are negative, i.e. neurons with positive weights towards the correct output $0$ will become less activated for this and similar inputs of class $0$ after the learning update and hidden neurons with negative weight will become more active -- exactly opposite to what one would expect for efficient learning. (F) distributions of weights from hidden neurons onto neuron $0$ after $30$ epochs of training on class $0$. (G) Average firing rate of hidden neurons, in response to inputs of class $0$ during the last mini-batch of the same training. Neurons are in the same order in F and G (sorted by their weight onto output $0$).
Figure 4: Summary of initial SHD classification results with a simple network, including regularisation only. (A) Learning curves for training (green) and testing accuracy (red). "ffwd" are feed-forward networks, "recur" recurrent networks. Curves are the mean of $8$ repeated runs with different random seeds and shaded areas indicate one standard deviation around the mean. The black arrows indicate the location of the best-achieved training accuracy and point to the values summarised in panels B and C. The grey arrows marks the highest test accuracy. (B) average accuracies in feedforward networks at the epoch with the best validation error for cross-validation and at the epoch with the best training error for train/test (black arrows in A). Values are the average across 10 folds in leave-one-speaker-out cross-validation and the average across 8 independent runs for train/test. Error bars are the corresponding standard deviations. (C) as B but for recurrent networks. The results for the failing ${\cal L}_{\text{x-entropy}}$ loss were omitted in this figure to avoid too much clutter.
Figure 5: Ablation study on the SHD dataset. (A) accuracy on the test set as mean (line) and standard deviation (errorbars) of 8 independent runs with different random number seeds. The panels are for different combinations of homogeneous and heterogeneous initialisation of $\tau_{\text{mem}}$ and $\tau_{\text{syn}}$ and for static or trained $\tau$ values as indicated. The different coloured lines correspond to the different augmentations applied as shown. (B) Wall clock time per sample during training as a function of test accuracy for all the different conditions as indicated by the symbols and colours. This data includes runs with $\Delta t = 1$ms and $\Delta t= 2$ms. (C) Number of parameters, including tau values where trained, of the different networks as a function of the final test accuracy. Both B and C use the mean accuracy over 8 independent runs as in A.
...and 5 more figures

Loss shaping enhances exact gradient learning with Eventprop in spiking neural networks

TL;DR

Abstract

Loss shaping enhances exact gradient learning with Eventprop in spiking neural networks

Authors

TL;DR

Abstract

Table of Contents

Figures (10)