Table of Contents
Fetching ...

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann

TL;DR

This work tackles the problem of accurately estimating DNN latency on edge accelerators across wide architectural variants without resorting to slow RTL simulations. It introduces ACADL, an object-oriented language for multi-abstraction accelerator modeling, and the Architectural Instruction Dependency Graph (AIDG) to capture instruction-level dependencies and resource conflicts for fast latency estimation. By mapping DNNs via TVM into ACADL models and evaluating the resulting AIDG, the approach achieves high accuracy (outperforming regression and analytical models) with orders-of-magnitude faster runtimes (e.g., estimating latency for 4.19×10^9 instructions using as few as 154 loop-kernel iterations). The method is demonstrated on four accelerator families (UltraTrail, Gemmini, a parameterizable systolic array, and Plasticine-derived designs) and multiple DNN workloads, enabling rapid hardware-aware NAS and design-space exploration with practical memory footprints. The results suggest this ACADL/AIDG framework can significantly accelerate hardware/software co-design for edge AI by providing precise, scalable latency estimates early in the design process, with potential for integration into automated NAS loops and RTL-backed validation when needed. $0.0001 ext{\%}$ evaluation granularity and $4.19\times10^{9}$ instructions are handled efficiently, illustrating the approach’s applicability to large-scale DNN workloads. $154$ loop-kernel iterations suffice in the best case to estimate end-to-end latency, highlighting the method’s speed advantages over conventional RTL simulations.

Abstract

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.

Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators

TL;DR

This work tackles the problem of accurately estimating DNN latency on edge accelerators across wide architectural variants without resorting to slow RTL simulations. It introduces ACADL, an object-oriented language for multi-abstraction accelerator modeling, and the Architectural Instruction Dependency Graph (AIDG) to capture instruction-level dependencies and resource conflicts for fast latency estimation. By mapping DNNs via TVM into ACADL models and evaluating the resulting AIDG, the approach achieves high accuracy (outperforming regression and analytical models) with orders-of-magnitude faster runtimes (e.g., estimating latency for 4.19×10^9 instructions using as few as 154 loop-kernel iterations). The method is demonstrated on four accelerator families (UltraTrail, Gemmini, a parameterizable systolic array, and Plasticine-derived designs) and multiple DNN workloads, enabling rapid hardware-aware NAS and design-space exploration with practical memory footprints. The results suggest this ACADL/AIDG framework can significantly accelerate hardware/software co-design for edge AI by providing precise, scalable latency estimates early in the design process, with potential for integration into automated NAS loops and RTL-backed validation when needed. evaluation granularity and instructions are handled efficiently, illustrating the approach’s applicability to large-scale DNN workloads. loop-kernel iterations suffice in the best case to estimate end-to-end latency, highlighting the method’s speed advantages over conventional RTL simulations.

Abstract

Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.
Paper Structure (22 sections, 27 equations, 18 figures, 7 tables, 1 algorithm)

This paper contains 22 sections, 27 equations, 18 figures, 7 tables, 1 algorithm.

Figures (18)

  • Figure 1: Overview of the automatic DNN accelerator performance model generation approach.
  • Figure 2: Abstract Computer Architecture Description Language class diagram.
  • Figure 3: Block diagram and example instructions for a for a 2$\times$2 systolic array.
  • Figure 4: Classes and ACADL object diagram for a 2$\times$2 systolic array.
  • Figure 5: Block diagram and example instructions for the UltraTrail accelerator ultratrail2020.
  • ...and 13 more figures