Table of Contents
Fetching ...

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Luca Colagrande, Lorenzo Leone, Chen Wu, Tim Fischer, Raphael Roth, Luca Benini

Abstract

The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9x and 2.5x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.

A Lightweight High-Throughput Collective-Capable NoC for Large-Scale ML Accelerators

Abstract

The exponential increase in Machine Learning (ML) model size and complexity has driven unprecedented demand for high-performance acceleration systems. As technology scaling enables the integration of thousands of computing elements onto a single die, the boundary between distributed and on-chip systems has blurred, making efficient on-chip collective communication increasingly critical. In this work, we present a lightweight, collective-capable Network on Chip (NoC) that supports efficient barrier synchronization alongside scalable, high-bandwidth multicast and reduction operations, co-designed for the next generation of ML accelerators. We introduce Direct Compute Access (DCA), a novel paradigm that grants the interconnect fabric direct access to the cores' computational resources, enabling high-throughput in-network reductions with a small 16.5% router area overhead. Through in-network hardware acceleration, we achieve 2.9x and 2.5x geomean speedups on multicast and reduction operations involving between 1 and 32 KiB of data, respectively. Furthermore, by keeping communication off the critical path in GEMM workloads, these features allow our architecture to scale efficiently to large meshes, resulting in up to 3.8x and 2.4x estimated performance gains through multicast and reduction support, respectively, compared to a baseline unicast NoC architecture, and up to 1.17x estimated energy savings.

Paper Structure

This paper contains 35 sections, 15 equations, 16 figures, 1 table.

Figures (16)

  • Figure 1: (a) Overview of the $5\times4$ collective-capable system. (b) Cluster tile and its main components: (c) compute cluster, (d) network interface and (e) router with collective extensions. (f) Centralized reduction controller enabling arithmetic in-network computation. Highlighted in orange are all modules affected (partially highlighted) or introduced (fully highlighted) by our extensions.
  • Figure 2: (a) Area breakdown of the router for different hardware configurations. Percentages indicate the area overhead with respect to the baseline. (b) Runtime of the software and hardware barriers.
  • Figure 3: Placed-and-routed implementation of the cluster tile, with the , the router and the L1 interconnect highlighted. The remaining area is occupied by the Snitch cores, L1 , I$ subsystem and cluster , which are not highlighted for clarity.
  • Figure 4: Three software multicast implementations: (a) naive sequential, (b) pipelined sequential, (c) tree-based. Each block represents a transfer: the containing row represents the initiator and the label indicates source and destination (source $\rightarrow$ destination). Red lines represent barriers.
  • Figure 5: Runtime (in cycles) of: (a) a 1D multicast transfer; (b) the implementation for various settings of $\alpha_i + \delta, \forall\ i > 0$, labeled next to each curve; (c) a 2D multicast transfer.
  • ...and 11 more figures