Table of Contents
Fetching ...

A Multicast-Capable AXI Crossbar for Many-core Machine Learning Accelerators

Luca Colagrande, Luca Benini

TL;DR

This work tackles memory and interconnect bottlenecks in massively parallel ML accelerators by introducing a multicast-capable AXI crossbar. The authors implement a mask-based multi-address encoding that extends AXI with backward-compatible multicast support, integrate it into the open-source Occamy accelerator, and validate both area/timing and performance. Results show modest area overhead (roughly 9–12%) and small frequency impact, with substantial performance gains on key kernels: microbenchmarks reach up to ~16× speedups, and matrix multiplication benefits reach about ~3.4× in GFLOPS (≈29% kernel improvement in the reference system). Overall, the solution demonstrates practical, low-overhead multicast as a viable path to improving on-chip bandwidth utilization in many-core accelerators, and is released as open-source for adoption with standard IPs.

Abstract

To keep up with the growing computational requirements of machine learning workloads, many-core accelerators integrate an ever-increasing number of processing elements, putting the efficiency of memory and interconnect subsystems to the test. In this work, we present the design of a multicast-capable AXI crossbar, with the goal of enhancing data movement efficiency in massively parallel machine learning accelerators. We propose a lightweight, yet flexible, multicast implementation, with a modest area and timing overhead (12% and 6% respectively) even on the largest physically-implementable 16-to-16 AXI crossbar. To demonstrate the flexibility and end-to-end benefits of our design, we integrate our extension into an open-source 288-core accelerator. We report tangible performance improvements on a key computational kernel for machine learning workloads, matrix multiplication, measuring a 29% speedup on our reference system.

A Multicast-Capable AXI Crossbar for Many-core Machine Learning Accelerators

TL;DR

This work tackles memory and interconnect bottlenecks in massively parallel ML accelerators by introducing a multicast-capable AXI crossbar. The authors implement a mask-based multi-address encoding that extends AXI with backward-compatible multicast support, integrate it into the open-source Occamy accelerator, and validate both area/timing and performance. Results show modest area overhead (roughly 9–12%) and small frequency impact, with substantial performance gains on key kernels: microbenchmarks reach up to ~16× speedups, and matrix multiplication benefits reach about ~3.4× in GFLOPS (≈29% kernel improvement in the reference system). Overall, the solution demonstrates practical, low-overhead multicast as a viable path to improving on-chip bandwidth utilization in many-core accelerators, and is released as open-source for adoption with standard IPs.

Abstract

To keep up with the growing computational requirements of machine learning workloads, many-core accelerators integrate an ever-increasing number of processing elements, putting the efficiency of memory and interconnect subsystems to the test. In this work, we present the design of a multicast-capable AXI crossbar, with the goal of enhancing data movement efficiency in massively parallel machine learning accelerators. We propose a lightweight, yet flexible, multicast implementation, with a modest area and timing overhead (12% and 6% respectively) even on the largest physically-implementable 16-to-16 AXI crossbar. To demonstrate the flexibility and end-to-end benefits of our design, we integrate our extension into an open-source 288-core accelerator. We report tangible performance improvements on a key computational kernel for machine learning workloads, matrix multiplication, measuring a 29% speedup on our reference system.

Paper Structure

This paper contains 8 sections, 3 figures.

Figures (3)

  • Figure 1: Examples of contiguous (left) and strided (right) address sets representable with our encoding, as paths in the binary number tree. Mask bits selectively fork the path of the original address (blue).
  • Figure 2: (a) Block diagram of a 4-to-4 AXI , (b) AXI mux submodule (unicast datapath is highlighted in blue, multicast datapath in green, and the logic arbitrating the two in orange), (c) Occamy SoC and (d) AXI demux submodule (multicast stall logic is highlighted in orange, logic controlling AW channel forking in blue, and B channel joining in green); (e) Scenario creating the deadlock condition.
  • Figure 3: (a) Area of the original and multicast-capable (numbers on top of the bars report the area increase); (b) Speedup on the microbenchmark with our extensions (numbers on top of the bars report the equivalent parallel fraction according to Amdahl's law for the 32 KiB data points); (c) Performance of the matmul kernel; (d) Parallelization and scheduling of the matmul kernel.