Table of Contents
Fetching ...

DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications

Marcelo Orenes-Vera, Esin Tureci, Margaret Martonosi, David Wentzlaff

TL;DR

The paper tackles the challenge of scaling irregular graph and sparse-data workloads by proposing DCRA, a distributed chiplet-based reconfigurable architecture. DCRA combines composable chiplets, a software-configurable 2D Torus NoC, SRAM as cache/scratchpad, and DRAM/HBM interleaving to allow post-silicon, packaging-time optimization for different target metrics. Through extensive evaluation on six irregular applications and multiple datasets, the authors quantify tradeoffs across pre-silicon Configurations (tile/SRAM/PU choices), packaging-time (HBM vs SRAM interleaving), and compile-time parameters (queue sizes, grid size), along with a detailed cost model. The results demonstrate substantial performance and cost benefits over monolithic designs like Dalorex and provide a practical framework for architects to tailor nodes for specific throughput, energy, or cost targets in irregular workloads. Overall, DCRA enables scalable, configurable hardware for irregular applications by decoupling design decisions from silicon at packaging time and supporting dynamic topologies via a reconfigurable torus network.

Abstract

In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications.

DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications

TL;DR

The paper tackles the challenge of scaling irregular graph and sparse-data workloads by proposing DCRA, a distributed chiplet-based reconfigurable architecture. DCRA combines composable chiplets, a software-configurable 2D Torus NoC, SRAM as cache/scratchpad, and DRAM/HBM interleaving to allow post-silicon, packaging-time optimization for different target metrics. Through extensive evaluation on six irregular applications and multiple datasets, the authors quantify tradeoffs across pre-silicon Configurations (tile/SRAM/PU choices), packaging-time (HBM vs SRAM interleaving), and compile-time parameters (queue sizes, grid size), along with a detailed cost model. The results demonstrate substantial performance and cost benefits over monolithic designs like Dalorex and provide a practical framework for architects to tailor nodes for specific throughput, energy, or cost targets in irregular workloads. Overall, DCRA enables scalable, configurable hardware for irregular applications by decoupling design decisions from silicon at packaging time and supporting dynamic topologies via a reconfigurable torus network.

Abstract

In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications.
Paper Structure (21 sections, 12 figures, 3 tables)

This paper contains 21 sections, 12 figures, 3 tables.

Figures (12)

  • Figure 1: Two possible integrations of the same 32x32-tile DCRA die (dimensions depicted with 256KB SRAM/tile). Top: two packages on a board, each featuring only DCRA dies, optimized for time-to-solution as it maximizes parallelization (lower data footprint per die). Bottom: a single package with DCRA dies and stacked DRAM, optimized for performance-per-dollar.
  • Figure 2: Horizontal links within a DCRA die and across dies. The red links show the NoC that connects every tile (tile-NoC), while the blue links show the NoC that connects to one tile per die (die-NoC). Because of the die-NoC, the routers at the die edges are radix-9, while the rest are radix-5. The ports shadowed in blue are runtime reconfigurable; any tile subgrid within a node board may become torus (including across packages). The dies on the edges of a package will interface with the I/O die. All I/O links are configured when loading the dataset to maximize I/O bandwidth. During program execution, both NoCs may become torus, or tile-NoC torus and die-NoC mesh, to keep streaming from I/O. Note that since the torus is folded, all the links within each NoC are nearly the same length. Even the longest wires coming from die-NoC are shorter than the 25mm die-to-die limit bow for the integrations shown in \ref{['fig:cake']}.
  • Figure 3: Top view of a package with 128x128 tiles. The number of DCRA dies would determine the compute capacity, while including HBM dies would determine the compute-to-memory ratio of the package. The number of I/O dies (and their bandwidth) determines the off-chip bandwidth.
  • Figure 4: Performance, energy efficiency, and performance per dollar improvements of different network choices over a baseline of a 32-bit 2D-Mesh. All configurations use 64x64 tiles (across 16 chiplets), with 512KB SRAM/tile.
  • Figure 5: Performance, energy efficiency, and performance per dollar improvements of different SRAM sizes and number of tiles per HBM channel, over a baseline of 64KB SRAM and 128 tiles per HBM channel (T/C). A DCRA chiplet is always attached to a single 8-channel HBM device, and thus, the number of tiles per chiplet determines the ratio of tiles per HBM channel.
  • ...and 7 more figures