Table of Contents
Fetching ...

Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators

Mariam Musavi, Emmanuel Irabor, Abhijit Das, Eduard Alarcon, Sergi Abadal

TL;DR

This work tackles the scalability bottlenecks of AI workloads on multi-chiplet accelerators by profiling data movement, with a focus on inter-chiplet multicast traffic. Using GEMINI-driven mappings and trace extraction, the authors quantify time spent in communication, message counts, and NoP hops across 12 workloads and multiple chiplet configurations. They find that inter-chiplet NoP traffic often dominates and multicast traffic significantly impacts performance as chiplets scale, signaling potential bottlenecks in many workloads. The study contributes a multicast characterization methodology and argues for flexible, movement-aware interconnects (including wireless or optical options) to improve performance and energy efficiency in large-scale AI accelerators.

Abstract

Next-generation artificial intelligence (AI) workloads are posing challenges of scalability and robustness in terms of execution time due to their intrinsic evolving data-intensive characteristics. In this paper, we aim to analyse the potential bottlenecks caused due to data movement characteristics of AI workloads on scale-out accelerator architectures composed of multiple chiplets. Our methodology captures the unicast and multicast communication traffic of a set of AI workloads and assesses aspects such as the time spent in such communications and the amount of multicast messages as a function of the number of employed chiplets. Our studies reveal that some AI workloads are potentially vulnerable to the dominant effects of communication, especially multicast traffic, which can become a performance bottleneck and limit their scalability. Workload profiling insights suggest to architect a flexible interconnect solution at chiplet level in order to improve the performance, efficiency and scalability of next-generation AI accelerators.

Communication Characterization of AI Workloads for Large-scale Multi-chiplet Accelerators

TL;DR

This work tackles the scalability bottlenecks of AI workloads on multi-chiplet accelerators by profiling data movement, with a focus on inter-chiplet multicast traffic. Using GEMINI-driven mappings and trace extraction, the authors quantify time spent in communication, message counts, and NoP hops across 12 workloads and multiple chiplet configurations. They find that inter-chiplet NoP traffic often dominates and multicast traffic significantly impacts performance as chiplets scale, signaling potential bottlenecks in many workloads. The study contributes a multicast characterization methodology and argues for flexible, movement-aware interconnects (including wireless or optical options) to improve performance and energy efficiency in large-scale AI accelerators.

Abstract

Next-generation artificial intelligence (AI) workloads are posing challenges of scalability and robustness in terms of execution time due to their intrinsic evolving data-intensive characteristics. In this paper, we aim to analyse the potential bottlenecks caused due to data movement characteristics of AI workloads on scale-out accelerator architectures composed of multiple chiplets. Our methodology captures the unicast and multicast communication traffic of a set of AI workloads and assesses aspects such as the time spent in such communications and the amount of multicast messages as a function of the number of employed chiplets. Our studies reveal that some AI workloads are potentially vulnerable to the dominant effects of communication, especially multicast traffic, which can become a performance bottleneck and limit their scalability. Workload profiling insights suggest to architect a flexible interconnect solution at chiplet level in order to improve the performance, efficiency and scalability of next-generation AI accelerators.

Paper Structure

This paper contains 11 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: An illustration of multi-chiplet architecture with 3x3 computing chiplets and 4 DRAM chiplets.
  • Figure 2: Methodology for characterizing data movement of AI workloads in multi-chip architectures, based on GEMINI cai2024gemini.
  • Figure 3: Fraction of overall execution time (in clock cycles) spent by on-chip, chip-to-chip and chip-to-DRAM data movement related tasks for each AI workload across configurations.
  • Figure 4: Fraction of data movement time spent in the DRAM, NoP and NoC across all AI workloads and chiplet array configurations. The total time in clock cycles is shown in red at the bottom of the plot.
  • Figure 5: Fraction of time spent by NoP over the total communication time across all chiplet array configurations.
  • ...and 2 more figures