Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework
F. N. Peccia, O. Bringmann
TL;DR
This paper addresses the challenge of deploying DNNs on heterogeneous SoCs with accelerators by enabling end-to-end hardware-aware auto-tuning. It proposes a generic GEMM schedule for a $DIM\\times DIM$ systolic-array accelerator and integrates it into TVM via a Gemmini backend, using AutoTVM with an XGB tuner to search the schedule space. In addition, quantization management folds the remaining constants into the bias and applies requantization to accelerate full quantized GEMM on hardware, with the operation count defined as $OP = 2\\times M\\times N\\times K + M\\times N$. Experiments on a Xilinx ZCU102 FPGA at 100 MHz show up to $46$ GOPs and demonstrate improvements over prior work and Gemmini hand-tuned kernels on Baidu DeepBench workloads, validating the utility of end-to-end hardware-aware auto-tuning for DNN operators.
Abstract
The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules and mapping strategies provided by the accelerator developers for typical tensor operations are not necessarily the best possible ones for each particular use case. This is why frameworks which automatically test the performance of the generated code on a specific hardware configuration are of special interest. In this work, the integration between the code generation framework TVM and the systolic array-based accelerator Gemmini is presented. A generic schedule to offload the GEneral Matrix Multiply (GEMM) tensor operation onto Gemmini is detailed, and its suitability is tested by executing the AutoTVM tuning process on it. Our generated code achieves a peak throughput of 46 giga-operations per second (GOPs) under a 100 MHz clock on a Xilinx ZCU102 FPGA, outperforming previous work. Furthermore, the code generated by this integration was able to surpass the default hand-tuned schedules provided by the Gemmini developers in real-world workloads.
