Table of Contents
Fetching ...

Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion

Yanchen Li, Jiachun Li, Kebin Sun, Luziwei Leng, Ran Cheng

TL;DR

This work tackles the slow training of Spiking Neural Networks (SNNs) on GPUs caused by temporal dynamics. It introduces temporal fusion to decouple and fuse LIF neuron propagation, enabling layer-wise processing across all time steps on a single GPU and extending to multi-GPU pipelines with pipeline parallelism. The authors present a CUDA-based implementation integrated with PyTorch, derive a theoretical speedup model, and demonstrate 5×–40× accelerations across static and event-based benchmarks while preserving accuracy. They further analyze time-step scalability and multi-GPU performance, showing increased benefits as temporal depth grows, with an optimal GPU count near $\sqrt{T_s/T_c}$. The approach promises scalable SNN training on commodity GPUs, supporting larger temporal horizons and bridging SNN research with practical deployment.

Abstract

Drawing on the intricate structures of the brain, Spiking Neural Networks (SNNs) emerge as a transformative development in artificial intelligence, closely emulating the complex dynamics of biological neural networks. While SNNs show promising efficiency on specialized sparse-computational hardware, their practical training often relies on conventional GPUs. This reliance frequently leads to extended computation times when contrasted with traditional Artificial Neural Networks (ANNs), presenting significant hurdles for advancing SNN research. To navigate this challenge, we present a novel temporal fusion method, specifically designed to expedite the propagation dynamics of SNNs on GPU platforms, which serves as an enhancement to the current significant approaches for handling deep learning tasks with SNNs. This method underwent thorough validation through extensive experiments in both authentic training scenarios and idealized conditions, confirming its efficacy and adaptability for single and multi-GPU systems. Benchmarked against various existing SNN libraries/implementations, our method achieved accelerations ranging from $5\times$ to $40\times$ on NVIDIA A100 GPUs. Publicly available experimental codes can be found at https://github.com/EMI-Group/snn-temporal-fusion.

Towards Scalable GPU-Accelerated SNN Training via Temporal Fusion

TL;DR

This work tackles the slow training of Spiking Neural Networks (SNNs) on GPUs caused by temporal dynamics. It introduces temporal fusion to decouple and fuse LIF neuron propagation, enabling layer-wise processing across all time steps on a single GPU and extending to multi-GPU pipelines with pipeline parallelism. The authors present a CUDA-based implementation integrated with PyTorch, derive a theoretical speedup model, and demonstrate 5×–40× accelerations across static and event-based benchmarks while preserving accuracy. They further analyze time-step scalability and multi-GPU performance, showing increased benefits as temporal depth grows, with an optimal GPU count near . The approach promises scalable SNN training on commodity GPUs, supporting larger temporal horizons and bridging SNN research with practical deployment.

Abstract

Drawing on the intricate structures of the brain, Spiking Neural Networks (SNNs) emerge as a transformative development in artificial intelligence, closely emulating the complex dynamics of biological neural networks. While SNNs show promising efficiency on specialized sparse-computational hardware, their practical training often relies on conventional GPUs. This reliance frequently leads to extended computation times when contrasted with traditional Artificial Neural Networks (ANNs), presenting significant hurdles for advancing SNN research. To navigate this challenge, we present a novel temporal fusion method, specifically designed to expedite the propagation dynamics of SNNs on GPU platforms, which serves as an enhancement to the current significant approaches for handling deep learning tasks with SNNs. This method underwent thorough validation through extensive experiments in both authentic training scenarios and idealized conditions, confirming its efficacy and adaptability for single and multi-GPU systems. Benchmarked against various existing SNN libraries/implementations, our method achieved accelerations ranging from to on NVIDIA A100 GPUs. Publicly available experimental codes can be found at https://github.com/EMI-Group/snn-temporal-fusion.
Paper Structure (16 sections, 6 equations, 6 figures, 1 table)

This paper contains 16 sections, 6 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Schematics showcasing temporal fusion on both single and multi-GPU environments. The horizontal axis represents time step, while the vertical axis represents the hierarchical propagation of the networks. The temporal dimension is unfolded, with each square representing the neuron collection of a specific SNN layer at a given time step.
  • Figure 2: Forward (see left) and backward (see right) propagation in a monolayer LIF network via temporal fusion. The horizontal axis indicates time-step-wise propagation, and the vertical axis shows layer-by-layer progression, conforming to Eqs. \ref{['eq:LIF']}, \ref{['eq:spike']} and \ref{['eq:backLIF']}. The red-shaded area delineates the operator fusion range within the GPU kernel, merging $x_i^{(t)}$ and $x_i^{(t+1)}$ (as well as $\nabla_{y_i^{(t)}}L$ and $\nabla_{y_i^{(t-1)}}L$) for integrated GPU kernel processing, thereby minimizing the memory access overhead.
  • Figure 3: A comparative analysis of the traditional serial method versus the temporal fusion method in SNN training. "Compute" refers to GPU kernel computations, while "read" and "write" pertain to memory operations. $\boldsymbol{X}^{(t)}$, $\boldsymbol{V}^{(t)}$, and $\boldsymbol{Y}^{(t)}$ denote the tensor representations of neuronal input, membrane potential, and output, respectively, conforming to the element-wise variables $x_i^{(t)}$ and $v_i^{(t)}$ in Eq. \ref{['eq:LIF']}, along with $y_i^{(t)}$ in Eq. \ref{['eq:spike']}.
  • Figure 4: The simulated relationship between the acceleration ratio and the GPU count under three task conditions, as per Eq. \ref{['eq:rate']}.
  • Figure 5: Performance comparison of multiple methods for training monolayer LIF on single GPU at different time steps. The * symbols mark the self-implemented baselines, "Serial (PyTorch)" and "Serial (CUDA)" correspond to the PyTorch-based and the CUDA-based implementations of "Serial Training" method (see Fig. \ref{['fig:fused_analysis']}), respectively.
  • ...and 1 more figures