EventQueues: Autodifferentiable spike event queues for brain simulation on AI accelerators
Lennart P. L. Landsmeer, Amirreza Movahedin, Said Hamdioui, Christos Strydis
TL;DR
This work tackles efficient gradient-based training of spiking neural networks with delayed spike delivery on diverse AI accelerators. It derives generalized, autodifferentiable spike event queues and implements multiple data structures with custom gradients, enabling exact gradient propagation through delayed spikes. Through cross-platform benchmarks (CPU, GPU, TPU, Groq LPU), it demonstrates architecture-specific preferences (heaps on von Neumann, sorting on dataflow, ring buffers on GPUs) and highlights trade-offs involving memory capacity and spike dropping. The findings inform both algorithmic design and hardware development, suggesting that future autograd frameworks and specialized delay-enabled hardware can further improve performance and scalability of gradient-enabled SNNs.
Abstract
Spiking neural networks (SNNs), central to computational neuroscience and neuromorphic machine learning (ML), require efficient simulation and gradient-based training. While AI accelerators offer promising speedups, gradient-based SNNs typically implement sparse spike events using dense, memory-heavy data-structures. Existing exact gradient methods lack generality, and current simulators often omit or inefficiently handle delayed spikes. We address this by deriving gradient computation through spike event queues, including delays, and implementing memory-efficient, gradient-enabled event queue structures. These are benchmarked across CPU, GPU, TPU, and LPU platforms. We find that queue design strongly shapes performance. CPUs, as expected, perform well with traditional tree-based or FIFO implementations, while GPUs excel with ring buffers for smaller simulations, yet under higher memory pressure prefer more sparse data-structures. TPUs seem to favor an implementation based on sorting intrinsics. Selective spike dropping provides a simple performance-accuracy trade-off, which could be enhanced by future autograd frameworks adapting diverging primal/tangent data-structures.
