Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators
Mariam Musavi, Emmanuel Irabor, Abhijit Das, Eduard Alarcon, Sergi Abadal
TL;DR
This work tackles the scalability bottlenecks of AI workloads on multi-chiplet accelerators by profiling data movement, with a focus on inter-chiplet multicast traffic. Using GEMINI-driven mappings and trace extraction, the authors quantify time spent in communication, message counts, and NoP hops across 12 workloads and multiple chiplet configurations. They find that inter-chiplet NoP traffic often dominates and multicast traffic significantly impacts performance as chiplets scale, signaling potential bottlenecks in many workloads. The study contributes a multicast characterization methodology and argues for flexible, movement-aware interconnects (including wireless or optical options) to improve performance and energy efficiency in large-scale AI accelerators.
Abstract
Next-generation artificial intelligence (AI) workloads are posing challenges of scalability and robustness in terms of execution time due to their intrinsic evolving data-intensive characteristics. In this paper, we aim to analyse the potential bottlenecks caused due to data movement characteristics of AI workloads on scale-out accelerator architectures composed of multiple chiplets. Our methodology captures the unicast and multicast communication traffic of a set of AI workloads and assesses aspects such as the time spent in such communications and the amount of multicast messages as a function of the number of employed chiplets. Our studies reveal that some AI workloads are potentially vulnerable to the dominant effects of communication, especially multicast traffic, which can become a performance bottleneck and limit their scalability. Workload profiling insights suggest to architect a flexible interconnect solution at chiplet level in order to improve the performance, efficiency and scalability of next-generation AI accelerators.
