DCRA: A Distributed Chiplet-based Reconfigurable Architecture for Irregular Applications
Marcelo Orenes-Vera, Esin Tureci, Margaret Martonosi, David Wentzlaff
TL;DR
The paper tackles the challenge of scaling irregular graph and sparse-data workloads by proposing DCRA, a distributed chiplet-based reconfigurable architecture. DCRA combines composable chiplets, a software-configurable 2D Torus NoC, SRAM as cache/scratchpad, and DRAM/HBM interleaving to allow post-silicon, packaging-time optimization for different target metrics. Through extensive evaluation on six irregular applications and multiple datasets, the authors quantify tradeoffs across pre-silicon Configurations (tile/SRAM/PU choices), packaging-time (HBM vs SRAM interleaving), and compile-time parameters (queue sizes, grid size), along with a detailed cost model. The results demonstrate substantial performance and cost benefits over monolithic designs like Dalorex and provide a practical framework for architects to tailor nodes for specific throughput, energy, or cost targets in irregular workloads. Overall, DCRA enables scalable, configurable hardware for irregular applications by decoupling design decisions from silicon at packaging time and supporting dynamic topologies via a reconfigurable torus network.
Abstract
In recent years, the growing demand to process large graphs and sparse datasets has led to increased research efforts to develop hardware- and software-based architectural solutions to accelerate them. While some of these approaches achieve scalable parallelization with up to thousands of cores, adaptation of these proposals by the industry remained slow. To help solve this dissonance, we identified a set of questions and considerations that current research has not considered deeply. Starting from a tile-based architecture, we put forward a Distributed Chiplet-based Reconfigurable Architecture (DCRA) for irregular applications that carefully consider fabrication constraints that made prior work either hard or costly to implement or too rigid to be applied. We identify and study pre-silicon, package-time and compile-time configurations that help optimize DCRA for different deployments and target metrics. To enable that, we propose a practical path for manufacturing chip packages by composing variable numbers of DCRA and memory dies, with a software-configurable Torus network to connect them. We evaluate six applications and four datasets, with several configurations and memory technologies, to provide a detailed analysis of the performance, power, and cost of DCRA as a compute node for scale-out sparse data processing. Finally, we present our findings and discuss how DCRA's framework for design exploration can help guide architects to build scalable and cost-efficient systems for irregular applications.
