Table of Contents
Fetching ...

Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning

Kaniz Mishty, Mehdi Sadi

TL;DR

This work tackles the PPAC optimization challenge for chiplet-based AI accelerators by formulating a co-design framework, Chiplet-Gym, that integrates an analytical PPAC model into an OpenAI Gym environment and optimizes design points using reinforcement learning (PPO) alongside simulated annealing. It explores a vast design space spanning 2.5D and 5.5D packaging, chiplet allocation, and placement, and validates the approach with MLPerf benchmarks, showing that a 3D-stacked chiplet configuration can deliver up to $1.52\times$ higher throughput, $0.27\times$ energy, and $0.01\times$ die cost at iso-area, with packaging costs around $1.62\times$ monolithic baselines. The methodology combines detailed physical and economic models (yield, inter-chiplet latency, bandwidth, energy, and packaging cost) with robust optimization by running multiple RL seeds and SA runs to ensure near-global optima. The results highlight the practical impact of chiplet-based AI accelerators, offering substantial performance and energy efficiency gains while mitigating manufacturing costs through co-design and packaging innovations.

Abstract

Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52X throughput, 0.27X energy, and 0.01X die cost while incurring only 1.62X package cost of its monolithic counterpart at iso-area.

Chiplet-Gym: Optimizing Chiplet-based AI Accelerator Design with Reinforcement Learning

TL;DR

This work tackles the PPAC optimization challenge for chiplet-based AI accelerators by formulating a co-design framework, Chiplet-Gym, that integrates an analytical PPAC model into an OpenAI Gym environment and optimizes design points using reinforcement learning (PPO) alongside simulated annealing. It explores a vast design space spanning 2.5D and 5.5D packaging, chiplet allocation, and placement, and validates the approach with MLPerf benchmarks, showing that a 3D-stacked chiplet configuration can deliver up to higher throughput, energy, and die cost at iso-area, with packaging costs around monolithic baselines. The methodology combines detailed physical and economic models (yield, inter-chiplet latency, bandwidth, energy, and packaging cost) with robust optimization by running multiple RL seeds and SA runs to ensure near-global optima. The results highlight the practical impact of chiplet-based AI accelerators, offering substantial performance and energy efficiency gains while mitigating manufacturing costs through co-design and packaging innovations.

Abstract

Modern Artificial Intelligence (AI) workloads demand computing systems with large silicon area to sustain throughput and competitive performance. However, prohibitive manufacturing costs and yield limitations at advanced tech nodes and die-size reaching the reticle limit restrain us from achieving this. With the recent innovations in advanced packaging technologies, chiplet-based architectures have gained significant attention in the AI hardware domain. However, the vast design space of chiplet-based AI accelerator design and the absence of system and package-level co-design methodology make it difficult for the designer to find the optimum design point regarding Power, Performance, Area, and manufacturing Cost (PPAC). This paper presents Chiplet-Gym, a Reinforcement Learning (RL)-based optimization framework to explore the vast design space of chiplet-based AI accelerators, encompassing the resource allocation, placement, and packaging architecture. We analytically model the PPAC of the chiplet-based AI accelerator and integrate it into an OpenAI gym environment to evaluate the design points. We also explore non-RL-based optimization approaches and combine these two approaches to ensure the robustness of the optimizer. The optimizer-suggested design point achieves 1.52X throughput, 0.27X energy, and 0.01X die cost while incurring only 1.62X package cost of its monolithic counterpart at iso-area.
Paper Structure (40 sections, 17 equations, 12 figures, 7 tables, 2 algorithms)

This paper contains 40 sections, 17 equations, 12 figures, 7 tables, 2 algorithms.

Figures (12)

  • Figure 1: AI accelerator chiplet architecture
  • Figure 2: Top-level system architecture for different scenarios. (a) CPU, AI accelerator and HBM chiplets are connected in package level through 2.5D interconnects. CoWoS and EMIB are two options of 2.5D interconnects. (b) CPU and AI accelerator chiplets are connected through 2.5D interconnects and HBM is stacked on top of CPU and AI accelerator through 3D interconnects. (c) Two AI accelerator chiplets are stacked on top of each other through 3D interconnects and they are interconnected to CPU, HBM and other AI chiplets pair through 2.5D.
  • Figure 3: (a) Yield (left y-axis) and normalized cost per yielded area (right y-axis) vs area at different tech nodes. (b) Normalized latency vs number of chiplets.
  • Figure 4: Illustration of latency (in terms of hop) calculation. (a) AI-to-AI chiplet communication, considering the farthest chiplets as source-destination pair. (b) One HBM chiplet, located at the left connected in 2.5D, and the farthest AI chiplet as source-destination pair. (c) One HBM chiplet, 3D-stacked on top of a left-most AI chiplet, and the farthest AI chiplet as source-destination pair. (d) 5 HBM chiplets are placed in 5 different positions. The highest latency decreases from 6 hops (case (c)) to 3 hops with most of the AI chiplets can be provided with data in 2 hops by nearest HBMs.
  • Figure 5: Illustration of mapping and dataflow. (a) Splitting the matrices into smaller parts for different chiplets. (b) Initial data supply from DRAM. Once the chiplets are loaded with required data, computation begins. (c) Final output collection to the DRAM. In this dataflow, there is no inter-chiplet communication during computation for partial sum.
  • ...and 7 more figures