Table of Contents
Fetching ...

Directly Training Temporal Spiking Neural Network with Sparse Surrogate Gradient

Yang Li, Feifei Zhao, Dongcheng Zhao, Yi Zeng

TL;DR

This paper first analyzes the current problem of direct training using SGs and then proposes Masked Surrogate Gradients (MSGs) to balance the effectiveness of training and the sparseness of the gradient, thereby improving the generalization ability of SNNs.

Abstract

Brain-inspired Spiking Neural Networks (SNNs) have attracted much attention due to their event-based computing and energy-efficient features. However, the spiking all-or-none nature has prevented direct training of SNNs for various applications. The surrogate gradient (SG) algorithm has recently enabled spiking neural networks to shine in neuromorphic hardware. However, introducing surrogate gradients has caused SNNs to lose their original sparsity, thus leading to the potential performance loss. In this paper, we first analyze the current problem of direct training using SGs and then propose Masked Surrogate Gradients (MSGs) to balance the effectiveness of training and the sparseness of the gradient, thereby improving the generalization ability of SNNs. Moreover, we introduce a temporally weighted output (TWO) method to decode the network output, reinforcing the importance of correct timesteps. Extensive experiments on diverse network structures and datasets show that training with MSG and TWO surpasses the SOTA technique.

Directly Training Temporal Spiking Neural Network with Sparse Surrogate Gradient

TL;DR

This paper first analyzes the current problem of direct training using SGs and then proposes Masked Surrogate Gradients (MSGs) to balance the effectiveness of training and the sparseness of the gradient, thereby improving the generalization ability of SNNs.

Abstract

Brain-inspired Spiking Neural Networks (SNNs) have attracted much attention due to their event-based computing and energy-efficient features. However, the spiking all-or-none nature has prevented direct training of SNNs for various applications. The surrogate gradient (SG) algorithm has recently enabled spiking neural networks to shine in neuromorphic hardware. However, introducing surrogate gradients has caused SNNs to lose their original sparsity, thus leading to the potential performance loss. In this paper, we first analyze the current problem of direct training using SGs and then propose Masked Surrogate Gradients (MSGs) to balance the effectiveness of training and the sparseness of the gradient, thereby improving the generalization ability of SNNs. Moreover, we introduce a temporally weighted output (TWO) method to decode the network output, reinforcing the importance of correct timesteps. Extensive experiments on diverse network structures and datasets show that training with MSG and TWO surpasses the SOTA technique.
Paper Structure (19 sections, 12 equations, 6 figures, 2 tables, 1 algorithm)

This paper contains 19 sections, 12 equations, 6 figures, 2 tables, 1 algorithm.

Figures (6)

  • Figure 1: Extreme surrogate gradients can result in gradient mismatch or vanishing issues. For inputs sharing a consistent membrane potential distribution, the gradient's statistical histogram reflects two polar outcomes when using either excessively wide or narrow widths: a significant portion of parameters undergo updates, or conversely, only a minimal set of parameters are updated.
  • Figure 2: The whole workflow of the proposed methods. MSG generates a random mask before calculating the gradient, then uses smooth SG to assist training, and mask, on the other hand, provides stronger sparsity to mitigate the interference of surrogate gradients in training. TWO updates the weighted factors based on the historical output correctness of each timestep, reinforcing the importance of those moments that can be classified correctly in a single timestep.
  • Figure 3: The changes of the temporal importance factor varies with epoch. While an identical factor is assigned at the onset, the training progression reveals a discernible trend: the accuracy at initial time steps is frequently suboptimal compared to the elevated accuracy observed in subsequent steps. Consequently, this temporal importance factor eventually stabilizes to a revised value, enhancing the output decoding efficacy of SNN.
  • Figure 4: MSG and TWO help to improve the performance. We provide (A) the test accuracy and (B) the training loss curve on DVS-CIFAR10 with ResNet18. MSG helps to jump out the local minimum point, and TWO improves the ability to decode the output.
  • Figure 5: The effect of mask probability. We test the classification performance of SNNs with different network structures and surrogate gradients on DVS-CIFAR10 under different mask probabilities. (A) ResNet18 with Arctan gradient. (B) VGGNet with Arctan gradient. (C) VGGNet with PiecewiseLinear gradient.
  • ...and 1 more figures