The Turbo-Charged Mapper: Fast and Optimal Mapping for Accelerator Modeling and Evaluation
Michael Gilbert, Tanner Andrulis, Vivienne Sze, Joel S. Emer
TL;DR
The paper tackles the problem of optimally mapping DNN workloads to accelerator hardware by introducing dataplacement, a memory-level tiling concept that, together with dataflow and tile shapes, enables principled pruning of an enormous mapspace (up to $10^{37}$ possibilities). The Turbo-Charged Mapper (TCM) uses a currying-based analytical model to prune suboptimal mappings and fully explores the remaining space to guarantee optimal mappings within feasible runtimes (seconds to minutes). Key contributions include formalizing dataplacement, enabling redundant dataflow, non-helpful loop, and partial tile shape pruning, and delivering a four-step process that yields dramatic mapspace reductions (up to $32$ orders of magnitude) and significant energy-delay-product improvements over prior mappers. The approach enables robust hardware evaluation by ensuring that observed performance differences arise from hardware changes rather than suboptimal mappings, with demonstrated speed and EDP gains on GPT-3 and MobileNetV3 workloads on TPU-V4i-like and NVDLA-like architectures.
Abstract
The energy and latency of an accelerator running a deep neural network (DNN) depend on how the computation and data movement are scheduled in the accelerator (i.e., mapping). Optimizing mappings is essential to evaluating and designing accelerators. However, the space of mappings is large, and prior works can not guarantee finding optimal mappings because they use heuristics or metaheuristics to narrow down the space. These limitations preclude proper hardware evaluation, since designers can not tell whether performance differences are due to changes in hardware or suboptimal mapping. To address this challenge, we propose the Turbo-Charged Mapper (TCM), a fast mapper that is guaranteed to find optimal mappings. The key to our approach is that we define a new concept in mapping, called dataplacement, which, like the prior concept of dataflow, allows for clear analysis and comparison of mappings. Through it, we identify multiple opportunities to prune redundant and suboptimal mappings, reducing search space by up to 32 orders of magnitude. Leveraging these insights, TCM can perform full mapspace searches, making it the first mapper that can find optimal mappings in feasible runtime. Compared to prior mappers, we show that TCM can find optimal mappings quickly (less than a minute), while prior works can not find optimal mappings (energy-delay-product $21\%$ higher than optimal) even when given $1000\times$ the runtime ($>10$ hours).
