Table of Contents
Fetching ...

The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation

Michael Gilbert, Tanner Andrulis, Vivienne Sze, Joel S. Emer

TL;DR

The paper tackles the problem of optimally mapping DNN workloads to accelerator hardware by introducing dataplacement, a memory-level tiling concept that, together with dataflow and tile shapes, enables principled pruning of an enormous mapspace (up to $10^{37}$ possibilities). The Turbo-Charged Mapper (TCM) uses a currying-based analytical model to prune suboptimal mappings and fully explores the remaining space to guarantee optimal mappings within feasible runtimes (seconds to minutes). Key contributions include formalizing dataplacement, enabling redundant dataflow, non-helpful loop, and partial tile shape pruning, and delivering a four-step process that yields dramatic mapspace reductions (up to $32$ orders of magnitude) and significant energy-delay-product improvements over prior mappers. The approach enables robust hardware evaluation by ensuring that observed performance differences arise from hardware changes rather than suboptimal mappings, with demonstrated speed and EDP gains on GPT-3 and MobileNetV3 workloads on TPU-V4i-like and NVDLA-like architectures.

Abstract

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping). Optimizing mappings is essential to evaluating and designing accelerators. However, the space of mappings is large, and prior works can not guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow down the space. These limitations preclude proper hardware evaluation, since designers can not tell whether performance differences are due to changes in hardware or suboptimal mapping. To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that is guaranteed to find optimal mappings. The key to our approach is that we define a new concept in mapping, called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify multiple opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude. Leveraging these insights, TCM can perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, we show that TCM can find optimal mappings quickly (less than a minute), while prior works can not find optimal mappings (energy-delay-product $21\%$ higher than optimal) even when given $1000\times$ the runtime ($>10$ hours).

The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation

TL;DR

The paper tackles the problem of optimally mapping DNN workloads to accelerator hardware by introducing dataplacement, a memory-level tiling concept that, together with dataflow and tile shapes, enables principled pruning of an enormous mapspace (up to possibilities). The Turbo-Charged Mapper (TCM) uses a currying-based analytical model to prune suboptimal mappings and fully explores the remaining space to guarantee optimal mappings within feasible runtimes (seconds to minutes). Key contributions include formalizing dataplacement, enabling redundant dataflow, non-helpful loop, and partial tile shape pruning, and delivering a four-step process that yields dramatic mapspace reductions (up to orders of magnitude) and significant energy-delay-product improvements over prior mappers. The approach enables robust hardware evaluation by ensuring that observed performance differences arise from hardware changes rather than suboptimal mappings, with demonstrated speed and EDP gains on GPT-3 and MobileNetV3 workloads on TPU-V4i-like and NVDLA-like architectures.

Abstract

The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping). Optimizing mappings is essential to evaluating and designing accelerators. However, the space of mappings is large, and prior works can not guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow down the space. These limitations preclude proper hardware evaluation, since designers can not tell whether performance differences are due to changes in hardware or suboptimal mapping. To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that is guaranteed to find optimal mappings. The key to our approach is that we define a new concept in mapping, called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify multiple opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude. Leveraging these insights, TCM can perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, we show that TCM can find optimal mappings quickly (less than a minute), while prior works can not find optimal mappings (energy-delay-product higher than optimal) even when given the runtime ( hours).
Paper Structure (30 sections, 8 equations, 8 figures, 3 tables)

This paper contains 30 sections, 8 equations, 8 figures, 3 tables.

Figures (8)

  • Figure 1: (a) An example LoopTree, the different types of nodes, and their meaning. (b) An example LoopTree with tiling.
  • Figure 2: Constructing a mapspace for the Einsum in Eq. \ref{['eq:mm']} by choosing dataplacement, dataflow, then tile shape. The part of the mapping chosen in each step is highlighted in red. In (2) and (3), only some choices are shown; similar choices must be made for parts in ellipses.
  • Figure 3: (a) An example dataplacement and slots where loops may be inserted. (b) An example mapping using the dataplacement, including tile shapes and numbers of tiles fetched. Dataplacement shows the trade-off between tile size and number of tiles fetched: Storage nodes higher in the dataplacement reduce fetches, but can increase tile size.
  • Figure 4: A mapping with non-helpful loops. Notice that the $n_2$ and $n_0$ loops both have costs (more fetches, more tile size) but no benefits (lower fetches, lower tile size). The $n_1$ loop is strictly better than either. Loop pruning will eliminate the $n_2$ and $n_0$ loops while keeping the $n_1$ loop.
  • Figure 5: Overview of TCM. Each mapper process (in blue) is explained in detail in the sections in parentheses.
  • ...and 3 more figures