HARP: A Taxonomy for Heterogeneous and Hierarchical Processors for Mixed-reuse Workloads
Raveesh Garg, Michael Pellauer, Tushar Krishna
TL;DR
The paper tackles the challenge of mixed-reuse AI workloads by introducing HARP, a taxonomy that classifies hierarchical and heterogeneous processors (HHPs) along axes of compute location and heterogeneity. It pairs the taxonomy with a Timeloop-based evaluation framework (including a modified cost model) to study how resource partitioning and mapping affect performance and energy for transformer workloads. Key contributions include the Harp taxonomy, a framework for blackbox mapping of sub-accelerators, and a detailed empirical study showing when heterogeneous and hierarchical designs outperform homogeneous ones (notably in decoder-only models) and how bandwidth partitioning influences outcomes. The work provides a structured design space and actionable insights for building energy-efficient accelerators that can hide low-reuse operations behind high-reuse computations in mixed-reuse AI workloads.
Abstract
Artificial intelligence (AI) application domains consist of a mix of tensor operations with high and low arithmetic intensities (aka reuse). Hierarchical (i.e. compute along multiple levels of memory hierarchy) and heterogeneous (multiple different sub-accelerators) accelerators are emerging as a popular way to process mixed reuse workloads, and workloads which consist of tensor operators with diverse shapes. However, the space of hierarchical and/or heterogeneous processors (HHP's) is relatively under-explored. Prior works have proposed custom architectures to take advantage of heterogeneity to have multiple sub-accelerators that are efficient for different operator shapes. In this work, we propose HARP, a taxonomy to classify various hierarchical and heterogeneous accelerators and use the it to study the impact of heterogeneity at various levels in the architecture. HARP taxonomy captures various ways in which HHP's can be conceived, ranging from B100 cores with an "intra-node heterogeneity" between SM and tensor core to NeuPIM with cross-depth heterogeneity which occurs at different levels of memory hierarchy. We use Timeloop mapper to find the best mapping for sub-accelerators and also modify the Timeloop cost model to extend it to model hierarchical and heterogeneous accelerators.
