Table of Contents
Fetching ...

Scalable Construction of Spiking Neural Networks using up to thousands of GPUs

Bruno Golosio, Gianmarco Tiddia, José Villamar, Luca Pontisso, Luca Sergi, Francesco Simula, Pooja Babu, Elena Pastorelli, Abigail Morrison, Markus Diesmann, Alessandro Lonardo, Pier Stanislao Paolucci, Johanna Senk

TL;DR

The study tackles the challenge of simulating large-scale spiking neural networks on GPU clusters by redesigning network construction and spike communication for extreme parallelism. It introduces an onboard, memory-efficient GPU-based network-construction workflow that builds local connectivity and prepares MPI communication structures entirely on each GPU, using proxy image neurons to route remote spikes. The work analyzes two MPI-spike-delivery schemes (point-to-point and collective) and four optimization levels to trade GPU memory usage against time-to-solution, demonstrating strong and weak scaling on the Multi-Area Model and scalable balanced networks across thousands of GPUs. The results show substantial speedups in network construction, viable memory footprints for exascale-like machines, and practical guidance for structure-aware mapping and hybrid communication strategies, with code released as NEST GPU 2.0.

Abstract

Diverse scientific and engineering research areas deal with discrete, time-stamped changes in large systems of interacting delay differential equations. Simulating such complex systems at scale on high-performance computing clusters demands efficient management of communication and memory. Inspired by the human cerebral cortex -- a sparsely connected network of $\mathcal{O}(10^{10})$ neurons, each forming $\mathcal{O}(10^{3})$--$\mathcal{O}(10^{4})$ synapses and communicating via short electrical pulses called spikes -- we study the simulation of large-scale spiking neural networks for computational neuroscience research. This work presents a novel network construction method for multi-GPU clusters and upcoming exascale supercomputers using the Message Passing Interface (MPI), where each process builds its local connectivity and prepares the data structures for efficient spike exchange across the cluster during state propagation. We demonstrate scaling performance of two cortical models using point-to-point and collective communication, respectively.

Scalable Construction of Spiking Neural Networks using up to thousands of GPUs

TL;DR

The study tackles the challenge of simulating large-scale spiking neural networks on GPU clusters by redesigning network construction and spike communication for extreme parallelism. It introduces an onboard, memory-efficient GPU-based network-construction workflow that builds local connectivity and prepares MPI communication structures entirely on each GPU, using proxy image neurons to route remote spikes. The work analyzes two MPI-spike-delivery schemes (point-to-point and collective) and four optimization levels to trade GPU memory usage against time-to-solution, demonstrating strong and weak scaling on the Multi-Area Model and scalable balanced networks across thousands of GPUs. The results show substantial speedups in network construction, viable memory footprints for exascale-like machines, and practical guidance for structure-aware mapping and hybrid communication strategies, with code released as NEST GPU 2.0.

Abstract

Diverse scientific and engineering research areas deal with discrete, time-stamped changes in large systems of interacting delay differential equations. Simulating such complex systems at scale on high-performance computing clusters demands efficient management of communication and memory. Inspired by the human cerebral cortex -- a sparsely connected network of neurons, each forming -- synapses and communicating via short electrical pulses called spikes -- we study the simulation of large-scale spiking neural networks for computational neuroscience research. This work presents a novel network construction method for multi-GPU clusters and upcoming exascale supercomputers using the Message Passing Interface (MPI), where each process builds its local connectivity and prepares the data structures for efficient spike exchange across the cluster during state propagation. We demonstrate scaling performance of two cortical models using point-to-point and collective communication, respectively.

Paper Structure

This paper contains 6 sections, 21 equations, 13 figures, 1 table.

Figures (13)

  • Figure 1: Communication schemes for spike routing and delivery with one MPI process per GPU. Each process is identified by rank and color ($0$: blue, $1$: green, $2$: yellow). After neuron creation, all ranks contain an arbitrary number of neurons, each with a unique index in the rank local neuron array. (a) Point-to-point communication scheme. Remote connections (red dashed arrows) require MPI communication. Spikes emitted by neuron $1$ of rank $0$ and neurons $0$ and $2$ of rank $2$ are routed to their proxies in the target ranks, listed in array $\mathbf{T}$, by sending through MPI the corresponding positions $\mathbf{P}$ in the maps $(\mathbf{R}_{\tau, \sigma}, \mathbf{L}_{\tau, \sigma})$ that associate the index of the source neuron in rank $\sigma$ to the index of its proxy in the target rank $\tau$. $\mathbf{S}_{\tau, \sigma}$ in the source process corresponds to $\mathbf{R}_{\tau, \sigma}$ in the target process. (b) Collective communication scheme. All ranks belonging to an MPI group use specific indexing arrays, here group $0$ uses arrays denoted with suffix $0$. Neurons of rank $\sigma$, which have a proxy in any other rank of the group, are indexed in the host array $\mathbf{H}_{0, \sigma}$. Spikes emitted by neurons $1$ of rank $0$ and neurons $0$ and $2$ of rank $2$ are first routed to the groups to which they have remote connections using the group array $\mathbf{G}$ paired with their position $\mathbf{Q}_0$ in the host array (in this illustration, only one group exists; however, simultaneous participation in multiple groups is possible). Spikes are then communicated to all members of the group, and once received, these are routed to the proxy neurons using rank-specific image index array $\mathbf{I}_{0, \tau, \sigma}$. For both point-to-point and collective communication schemes, within each rank, remote connections from image neurons serve as the final link to local neurons receiving spikes.
  • Figure 2: Comparison between performance of offboard and onboard versions on the simulation of the MAM in the metastable state. Simulations are performed on $32$ nodes on JUSUF (one NVIDIA V100 GPU per node). Mean data shown from averaging over $10$ simulations using different random seeds, black error bars represent the standard deviations. $10$ simulations with different random seeds. (a) Bar plot of the network construction time divided into its subtasks, which are shown in chronological order from bottom to top: 1) simulator initialization, 2) neuron and device creation, 3) local connection generation, 4) remote connection generation, and 5) the organization of data structures for spike delivery (simulation preparation). Bar heights represent mean values, and black error bars standard deviations. (b) Box plot of state propagation time measured as the real-time factor. The central line represents the median, whereas the box indicates the interquartile range (IQR). The whiskers extend up to $1.5\times$IQR, and data exceeding this value are represented as outliers. Red stars indicate the mean of the distributions.
  • Figure 3: Network construction (a) and state propagation (b) times of the scalable balanced network model for the four optimization levels as a function of the number of cluster nodes (four GPUs per node). State propagation is measured as the real-time factor. The optimization level $3$ was also reported when spike recording was disabled. Optimization levels $2$ and $3$ in the network construction overlap.
  • Figure 4: Peak memory usage per GPU for the scalable balanced network model simulation as a function of the number of cluster nodes for the four optimization levels. The upper horizontal axis shows the total number of synapses of the model, indicating the increasing network size. Square markers and dashed lines represent the estimation of the memory peak obtained using four MPI processes on a single node. Dot markers and continuous lines represent simulated configurations over five simulations using different random seeds. Error bars represent standard deviations. On the bottom right of the figure, an inset of the measured data is shown to facilitate comparison with the estimated values. Dash-dotted horizontal line represents the memory limit of one NVIDIA A100 GPU. For these configurations, the first two levels of optimization show compatible results, and the points overlap.
  • Figure 5: Network construction time divided into neuron creation and connection -- panel a contribution of the neuron and device creation and the local and remote connection generation subtasks -- and simulation preparation -- panel b organization of data structures for spike delivery subtask -- times of the scalable balanced network model evaluated across all MPI processes as a function of the number of cluster nodes for the three optimization levels. The bars represent an estimation using four MPI processes on a single node. The horizontal line markers show the time averaged across the MPI processes using the optimization levels.
  • ...and 8 more figures