Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

F. N. Peccia; O. Bringmann

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

F. N. Peccia, O. Bringmann

TL;DR

This paper addresses the challenge of deploying DNNs on heterogeneous SoCs with accelerators by enabling end-to-end hardware-aware auto-tuning. It proposes a generic GEMM schedule for a $DIM\\times DIM$ systolic-array accelerator and integrates it into TVM via a Gemmini backend, using AutoTVM with an XGB tuner to search the schedule space. In addition, quantization management folds the remaining constants into the bias and applies requantization to accelerate full quantized GEMM on hardware, with the operation count defined as $OP = 2\\times M\\times N\\times K + M\\times N$. Experiments on a Xilinx ZCU102 FPGA at 100 MHz show up to $46$ GOPs and demonstrate improvements over prior work and Gemmini hand-tuned kernels on Baidu DeepBench workloads, validating the utility of end-to-end hardware-aware auto-tuning for DNN operators.

Abstract

The deployment of neural networks on heterogeneous SoCs coupled with custom accelerators is a challenging task because of the lack of end-to-end software tools provided for these systems. Moreover, the already available low level schedules and mapping strategies provided by the accelerator developers for typical tensor operations are not necessarily the best possible ones for each particular use case. This is why frameworks which automatically test the performance of the generated code on a specific hardware configuration are of special interest. In this work, the integration between the code generation framework TVM and the systolic array-based accelerator Gemmini is presented. A generic schedule to offload the GEneral Matrix Multiply (GEMM) tensor operation onto Gemmini is detailed, and its suitability is tested by executing the AutoTVM tuning process on it. Our generated code achieves a peak throughput of 46 giga-operations per second (GOPs) under a 100 MHz clock on a Xilinx ZCU102 FPGA, outperforming previous work. Furthermore, the code generated by this integration was able to surpass the default hand-tuned schedules provided by the Gemmini developers in real-world workloads.

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

TL;DR

This paper addresses the challenge of deploying DNNs on heterogeneous SoCs with accelerators by enabling end-to-end hardware-aware auto-tuning. It proposes a generic GEMM schedule for a

systolic-array accelerator and integrates it into TVM via a Gemmini backend, using AutoTVM with an XGB tuner to search the schedule space. In addition, quantization management folds the remaining constants into the bias and applies requantization to accelerate full quantized GEMM on hardware, with the operation count defined as

. Experiments on a Xilinx ZCU102 FPGA at 100 MHz show up to

GOPs and demonstrate improvements over prior work and Gemmini hand-tuned kernels on Baidu DeepBench workloads, validating the utility of end-to-end hardware-aware auto-tuning for DNN operators.

Abstract

Paper Structure (6 sections, 2 equations, 4 figures, 1 table)

This paper contains 6 sections, 2 equations, 4 figures, 1 table.

Introduction
Scheduling a GEMM operation on a systolic array
Integrating Gemmini into TVM
Quantization management
Experiments
Conclusions

Figures (4)

Figure 1: (\ref{['fig:params']}) shows the proposed GEMM schedule parameters for an accelerator based on a $DIM\times DIM$ systolic array able to execute a generic GEMM with form $C = A\times B + D$. (\ref{['fig:pseudocode']}) shows an example generated pseudocode for the operation, and (\ref{['fig:graph_repr']}) shows a graphical representation of the move of data in and out of the accelerator.
Figure 2: Integration workflow example for a neural network formed by a fully connected layer and a softmax layer
Figure 3: Results across different GEMM workloads. For each workload, $M=N=K=workload\:size$
Figure 4: Best schedules found by AutoTVM for the Baidu DeepBench dataset using our implementation.

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

TL;DR

Abstract

Integration of a systolic array based hardware accelerator into a DNN operator auto-tuning framework

Authors

TL;DR

Abstract

Table of Contents

Figures (4)