Automatic Generation of Fast and Accurate Performance Models for Deep Neural Network Accelerators
Konstantin Lübeck, Alexander Louis-Ferdinand Jung, Felix Wedlich, Mika Markus Müller, Federico Nicolás Peccia, Felix Thömmes, Jannik Steinmetz, Valentin Biermaier, Adrian Frischknecht, Paul Palomero Bernardo, Oliver Bringmann
TL;DR
This work tackles the problem of accurately estimating DNN latency on edge accelerators across wide architectural variants without resorting to slow RTL simulations. It introduces ACADL, an object-oriented language for multi-abstraction accelerator modeling, and the Architectural Instruction Dependency Graph (AIDG) to capture instruction-level dependencies and resource conflicts for fast latency estimation. By mapping DNNs via TVM into ACADL models and evaluating the resulting AIDG, the approach achieves high accuracy (outperforming regression and analytical models) with orders-of-magnitude faster runtimes (e.g., estimating latency for 4.19×10^9 instructions using as few as 154 loop-kernel iterations). The method is demonstrated on four accelerator families (UltraTrail, Gemmini, a parameterizable systolic array, and Plasticine-derived designs) and multiple DNN workloads, enabling rapid hardware-aware NAS and design-space exploration with practical memory footprints. The results suggest this ACADL/AIDG framework can significantly accelerate hardware/software co-design for edge AI by providing precise, scalable latency estimates early in the design process, with potential for integration into automated NAS loops and RTL-backed validation when needed. $0.0001 ext{\%}$ evaluation granularity and $4.19\times10^{9}$ instructions are handled efficiently, illustrating the approach’s applicability to large-scale DNN workloads. $154$ loop-kernel iterations suffice in the best case to estimate end-to-end latency, highlighting the method’s speed advantages over conventional RTL simulations.
Abstract
Implementing Deep Neural Networks (DNNs) on resource-constrained edge devices is a challenging task that requires tailored hardware accelerator architectures and a clear understanding of their performance characteristics when executing the intended AI workload. To facilitate this, we present an automated generation approach for fast performance models to accurately estimate the latency of a DNN mapped onto systematically modeled and concisely described accelerator architectures. Using our accelerator architecture description method, we modeled representative DNN accelerators such as Gemmini, UltraTrail, Plasticine-derived, and a parameterizable systolic array. Together with DNN mappings for those modeled architectures, we perform a combined DNN/hardware dependency graph analysis, which enables us, in the best case, to evaluate only 154 loop kernel iterations to estimate the performance for 4.19 billion instructions achieving a significant speedup. We outperform regression and analytical models in terms of mean absolute percentage error (MAPE) compared to simulation results, while being several magnitudes faster than an RTL simulation.
