Table of Contents
Fetching ...

Exploiting network topology in brain-scale simulations of spiking neural networks

Melissa Lober, Markus Diesmann, Susanne Kunkel

TL;DR

This work proposes a local-global hybrid communication architecture for large-scale neuronal network simulations as a first step in mapping the structure of the brain to the structure of a supercomputer.

Abstract

Simulation code for conventional supercomputers serves as a reference for neuromorphic computing systems. The present bottleneck of distributed large-scale spiking neuronal network simulations is the communication between compute nodes. Communication speed seems limited by the interconnect between the nodes and the software library orchestrating the data transfer. Profiling reveals, however, that the variability of the time required by the compute nodes between communication calls is large. The bottleneck is in fact the waiting time for the slowest node. A statistical model explains total simulation time on the basis of the distribution of computation times between communication calls. A fundamental cure is to avoid communication calls because this requires fewer synchronizations and reduces the variability of computation times across compute nodes. The organization of the mammalian brain into areas lends itself to such an optimization strategy. Connections between neurons within an area have short delays, but the delays of the long-range connections across areas are an order of magnitude longer. This suggests a structure-aware mapping of areas to compute nodes allowing for a partition into more frequent communication between nodes simulating a particular area and less frequent global communication. We demonstrate a substantial performance gain on a real-world example. This work proposes a local-global hybrid communication architecture for large-scale neuronal network simulations as a first step in mapping the structure of the brain to the structure of a supercomputer. It challenges the long-standing belief that the bottleneck of simulation is synchronization inherent in the collective calls of standard communication libraries. We provide guidelines for the energy efficient simulation of neuronal networks on conventional computing systems and raise the bar for neuromorphic systems.

Exploiting network topology in brain-scale simulations of spiking neural networks

TL;DR

This work proposes a local-global hybrid communication architecture for large-scale neuronal network simulations as a first step in mapping the structure of the brain to the structure of a supercomputer.

Abstract

Simulation code for conventional supercomputers serves as a reference for neuromorphic computing systems. The present bottleneck of distributed large-scale spiking neuronal network simulations is the communication between compute nodes. Communication speed seems limited by the interconnect between the nodes and the software library orchestrating the data transfer. Profiling reveals, however, that the variability of the time required by the compute nodes between communication calls is large. The bottleneck is in fact the waiting time for the slowest node. A statistical model explains total simulation time on the basis of the distribution of computation times between communication calls. A fundamental cure is to avoid communication calls because this requires fewer synchronizations and reduces the variability of computation times across compute nodes. The organization of the mammalian brain into areas lends itself to such an optimization strategy. Connections between neurons within an area have short delays, but the delays of the long-range connections across areas are an order of magnitude longer. This suggests a structure-aware mapping of areas to compute nodes allowing for a partition into more frequent communication between nodes simulating a particular area and less frequent global communication. We demonstrate a substantial performance gain on a real-world example. This work proposes a local-global hybrid communication architecture for large-scale neuronal network simulations as a first step in mapping the structure of the brain to the structure of a supercomputer. It challenges the long-standing belief that the bottleneck of simulation is synchronization inherent in the collective calls of standard communication libraries. We provide guidelines for the energy efficient simulation of neuronal networks on conventional computing systems and raise the bar for neuromorphic systems.
Paper Structure (16 sections, 18 equations, 12 figures)

This paper contains 16 sections, 18 equations, 12 figures.

Figures (12)

  • Figure 1: Strong-scaling of the multi-area model of macaque visual cortex (MAM) in the ground state. (a) Real time factors, defined as wall-clock time normalized by simulated model time, stacked for each phase of the simulation (legend). All simulations cover $T_\mathrm{model}=10\,\mathrm{s}$ of biological time. Error bars (at line resolution) indicate variability across three simulations using different random seeds. (b) Real time factor of communication phase extracted from panel (a) including synchronization. Dashed curve marks time attributed to pure MPI communication, as estimated from MPI benchmarks (see fig:mpi_benchmark) for a $T_\mathrm{model}=10\,\mathrm{s}$ simulation and average buffer sizes per target rank of $1408$, $837$, $514$, $317$ bytes reported by the simulations in (a) using $16, 32, 64, 128$ MPI processes, respectively. All data obtained on SuperMUC-NG using NEST version $3.6$.
  • Figure 2: Conventional and structure-aware simulation scheme illustrated for an example multi-area model. Neurons of three areas (middle) are color coded in blue, red, and green. Synaptic transmission delays are significantly shorter within areas than between areas; here, the minimum delays are $0.1\,\text{ms}$ and $1.0\,\text{ms}$, respectively. The minimum delay between any pair of neurons represented on two different MPI processes dictates the communication interval for the exchange of spikes between the processes. In conventional simulation technology, neurons are distributed across MPI processes and threads according to a round-robin scheme (left) to balance workload. Therefore, network structure can not be exploited, and global MPI communication is required every $0.1\,\text{ms}$. The structure-aware distribution scheme (right) maps areas to MPI processes, increasing the required interval for global communication to $1.0\,\text{ms}$.
  • Figure 3: Conventional and structure-aware simulation flow. Flow chart of the iteration over $S$ simulation cycles highlighting differences between the conventional simulation strategy (pink shaded arrow) and the structure-aware strategy (cyan shaded arrows). For every simulation cycle, each MPI process first delivers incoming spikes from the MPI receive buffer to their local target neurons (blue box). Second, each process updates all process-local neurons (red box). And third, it collocates new spikes in the MPI send buffer (yellow box). In the conventional simulation scheme, every simulation cycle terminates with a global MPI communication of spikes (left green box). In the structure-aware scheme, most cycles terminate with a process-local exchange of spikes (right green box), and only every $D$-th cycle terminates with a global MPI communication.
  • Figure 4: MPI collective performance for increasing message sizes. Time required for a single MPI_Alltoall() call (average over $1000$ calls) as a function of buffer size when using the OpenMPI library on SuperMUC-NG, shown for increasing numbers of MPI processes as indicated in legend. Dashed vertical lines indicate the typical buffer sizes per target rank for the MAM-benchmark, with $\sim{}130,000$ neurons per MPI rank, simulated with the conventional (pink) or structure-aware (cyan) strategy.
  • Figure 5: Graphical intuition of the predicted reduction in overall synchronization time and thus overall runtime in a multi-area model simulation. Illustration of $S\,=\,10$ simulation cycles on $M\,=\,32$ MPI processes when using the structure-aware simulation strategy (bottom) instead of the conventional strategy (top). All timing data is artificial and generated for illustration purposes. Wallclock time spent per simulation cycle is color coded: deliver (blue bars), update (red bars), and collocate (yellow bars). The exchange of spike data between MPI processes (communicate) is assumed to be taking up minimal time (green bars). The conventional strategy requires global communication after every simulation cycle. In case of collective blocking MPI communication, this entails synchronization (gray bars), which means that the process taking longest for the cycle requires all other processes to wait. The structure-aware strategy requires the same wallclock times per simulation cycle but reduces the number of global communications by a factor of $D$ (eq:d-min-ratio); here we assume $D\,=\,10$. This allows for the $10$ cycles to be simulated without intermittent synchronization and thus levels out variations such that the overall synchronization times are lower than the sums of the corresponding per-cycle synchronization times in the conventional case.
  • ...and 7 more figures