A Multicast-Capable AXI Crossbar for Many-core Machine Learning Accelerators
Luca Colagrande, Luca Benini
TL;DR
This work tackles memory and interconnect bottlenecks in massively parallel ML accelerators by introducing a multicast-capable AXI crossbar. The authors implement a mask-based multi-address encoding that extends AXI with backward-compatible multicast support, integrate it into the open-source Occamy accelerator, and validate both area/timing and performance. Results show modest area overhead (roughly 9–12%) and small frequency impact, with substantial performance gains on key kernels: microbenchmarks reach up to ~16× speedups, and matrix multiplication benefits reach about ~3.4× in GFLOPS (≈29% kernel improvement in the reference system). Overall, the solution demonstrates practical, low-overhead multicast as a viable path to improving on-chip bandwidth utilization in many-core accelerators, and is released as open-source for adoption with standard IPs.
Abstract
To keep up with the growing computational requirements of machine learning workloads, many-core accelerators integrate an ever-increasing number of processing elements, putting the efficiency of memory and interconnect subsystems to the test. In this work, we present the design of a multicast-capable AXI crossbar, with the goal of enhancing data movement efficiency in massively parallel machine learning accelerators. We propose a lightweight, yet flexible, multicast implementation, with a modest area and timing overhead (12% and 6% respectively) even on the largest physically-implementable 16-to-16 AXI crossbar. To demonstrate the flexibility and end-to-end benefits of our design, we integrate our extension into an open-source 288-core accelerator. We report tangible performance improvements on a key computational kernel for machine learning workloads, matrix multiplication, measuring a 29% speedup on our reference system.
