Table of Contents
Fetching ...

The BrainScaleS-2 multi-chip system: Interconnecting continuous-time neuromorphic compute substrates

Joscha Ilmberger, Johannes Schemmel

TL;DR

The paper tackles scaling continuous-time neuromorphic substrates beyond single chips by introducing an FPGA-based interconnect architecture for the BrainScaleS-2 system. It presents a routing-enabled multi-chip design with an Aggregator unit and per-chip Node-FPGAs that forms a star topology within a standard rack, achieving deterministic spike latencies and scalable connectivity. Key results include sub-1.3 µs backplane latencies, BER around 1e-15 at 5 Gbps links, and the ability to interconnect up to roughly 120 chips (≈61k neurons, ≈15M synapses) with modest additional latency. The work lays a scalable path toward large-scale analog SNNs and informs future systems with direct ASIC interconnects for higher density and efficiency, enabling large-scale training and experimentation in neuromorphic computing.

Abstract

The BrainScaleS-2 SoC integrates analog neuron and synapse circuits with digital periphery, including two CPUs with SIMD extensions. Each ASIC is connected to a Node-FPGA, providing experiment control and Ethernet connectivity. This work details the scaling of the compute substrate through FPGA-based interconnection via an additional Aggregator unit. The Aggregator provides up to 12 transceiver links to a backplane of Node-FPGAs, as well as 4 transceiver lanes for further extension. Two such interconnected backplanes are integrated into a standard 19in rack case with 4U height together with an Ethernet switch, system controller and power supplies. For all spike rates, chip-to-chip latencies -- consisting of four hops across three FPGAs -- below 1.3$μ$s are achieved within each backplane.

The BrainScaleS-2 multi-chip system: Interconnecting continuous-time neuromorphic compute substrates

TL;DR

The paper tackles scaling continuous-time neuromorphic substrates beyond single chips by introducing an FPGA-based interconnect architecture for the BrainScaleS-2 system. It presents a routing-enabled multi-chip design with an Aggregator unit and per-chip Node-FPGAs that forms a star topology within a standard rack, achieving deterministic spike latencies and scalable connectivity. Key results include sub-1.3 µs backplane latencies, BER around 1e-15 at 5 Gbps links, and the ability to interconnect up to roughly 120 chips (≈61k neurons, ≈15M synapses) with modest additional latency. The work lays a scalable path toward large-scale analog SNNs and informs future systems with direct ASIC interconnects for higher density and efficiency, enabling large-scale training and experimentation in neuromorphic computing.

Abstract

The BrainScaleS-2 SoC integrates analog neuron and synapse circuits with digital periphery, including two CPUs with SIMD extensions. Each ASIC is connected to a Node-FPGA, providing experiment control and Ethernet connectivity. This work details the scaling of the compute substrate through FPGA-based interconnection via an additional Aggregator unit. The Aggregator provides up to 12 transceiver links to a backplane of Node-FPGAs, as well as 4 transceiver lanes for further extension. Two such interconnected backplanes are integrated into a standard 19in rack case with 4U height together with an Ethernet switch, system controller and power supplies. For all spike rates, chip-to-chip latencies -- consisting of four hops across three FPGAs -- below 1.3s are achieved within each backplane.

Paper Structure

This paper contains 5 sections, 5 figures.

Figures (5)

  • Figure 1: Top: The bss2 multi-chip system. Fully equipped, it will consist of two backplanes with 12.0 interconnected bss2 soc each, corresponding to a total of 12.0 thousand neurons and 3.0 million synapse circuits. Bottom left: Back panel view with integrated Ethernet switch, system controller and ATX power supply. Bottom right: Deployment of first systems at the European Institute for Neuromorphic Computing (EINC) at Heidelberg University. A second-layer interconnect between all backplanes inside a rack is envisioned.
  • Figure 2: Overview of the neuromorphic multi-chip system. One backplane connects up to 12.0 bss2 SoCs with one Node-fpga each via adapter boards\ref{['footnote:tud']}. The adapter board contains all necessary periphery of the neuromorphic SoC such as level shifters, LDO regulators and DACs. Network sizes beyond a single chip can be achieved by interconnecting the transceivers of all Node-fpga to an additional Aggregator unit. The star topology allows for symmetric delays below 1.3µs of any source neuron to off-chip target synapse. Together with off-the-shelf components such as an ATX power supply, Ethernet switch, and ARM-based system controller, up to two backplanes can be combined into an air-cooled 4U high 19in rack case. This unit requires only mains power and Ethernet uplinks to operate.
  • Figure 3: Multi-chip extension to the existing Node-fpga design and routing logic. Just before the experiment real-time section, a system synchronization barrier command gets executed from the playback buffer, resulting in a request being sent via the mgt link to the aggregator. Once the aggregator has received the request from all participating Node-fpga, an external synchronization signal is toggled, causing the playback execution to continue. During the real-time section, output spikes of the bss2 soc neurons coming out of the layer-1 crossbar get a timestamp attached and are transported the Node-fpga via the layer-2 link. The multi-chip extension listens in on this traffic, discards the timestamp, unpacks the spikes and uses a Block-ram based lookup for 15 labels and routing enable. Inside the aggregator, spikes are broadcasted in an all-to-all connectivity scheme with static enables for each route. Spikes can be sent and received each clock cycle with the exception of clock-compensation pauses by the transceiver. All spikes that are sent back to a Node-fpga pass a reverse lookup to 16bss2 asic spike labels and are packed. After attaching the lower eight bit of the current system time, which is synchronized with the asic, the spikes can be sent through the layer-2 link. With a transceiver user clock of 250MHz, the maximum theoretical spike throughput of the bss2 asic can be sustained. The implementation of the routing logic is the simplest possible and should be seen as a baseline for testing more complex schemes.
  • Figure 4: Data eye and bathtub analysis of the Node-fpga to Aggregator (top) and reverse direction (bottom) multi-gigabit transceiver link using the manufacturer's analysis tool. The tested 8Gs data rate is the maximum supported by the Node-fpga. The data eye is exemplary for one link, whereas the bathtub curves are shown for all links in addition to the recommended margin. Bit-error rate tests up to $10^{-15}$ were successfully executed, suggesting a smaller required margin. The shown transceiver configuration was not optimized for power efficiency.
  • Figure 5: (A) Measurement of the latency of $2^{15}$ spikes with a 3:1 fan-in between Node-fpga (bottom) and between bss2 asic (top) for a range of regular rates up to congestion of the receiver. In the worst regime, the total event jitter constitutes roughly 15% of the median delay. The visible discretization of the distributions corresponds to the 8ns fpga system clock period used to measure the latency. (B) With the default hardware speed-up of $10^{3}$, the routing latency is one order of magnitude below typical membrane time constants measured in biology sunkin2012allen. The speed-up factor can be chosen within certain bounds due to large circuit parameter calibration ranges billaudelle2022accurate, reducing or increasing the model parameters with respect to the fixed routing latency as indicated.