Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

Xilai Dai; Yuzong Chen; Mohamed S. Abdelfattah

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

Xilai Dai, Yuzong Chen, Mohamed S. Abdelfattah

TL;DR

This work introduces Kratos, a specialized FPGA benchmark suite for unrolled DNN primitives that exploit fine-grained sparsity and mixed precision. It provides eight kernels (GEMM and convolution variants) with embedded weights, multiple input unrolling modes, and dual GEMM datapaths, all designed to enable architecture exploration on FPGA fabrics via both Quartus and the VTR flow. The study demonstrates that unrolled DNNs can operate at high frequencies on Arria 10 and achieve substantial area savings through sparsity and low bit-width, corroborated by a case study showing ~2× area reduction when tuning LUT sizes. By delivering automated generation, functional verification, and open-source tooling, Kratos aims to catalyze the design of domain-specific programmable architectures tailored for sparse and low-precision unrolled DNNs, advancing beyond conventional dense DNN accelerators.

Abstract

FPGAs offer a flexible platform for accelerating deep neural network (DNN) inference, particularly for non-uniform workloads featuring fine-grained unstructured sparsity and mixed arithmetic precision. To leverage these redundancies, an emerging approach involves partially or fully unrolling computations for each DNN layer. That way, parameter-level and bit-level ineffectual operations can be completely skipped, thus saving the associated area and power. Regardless, unrolled implementations scale poorly and limit the size of a DNN that can be unrolled on an FPGA. This motivates the investigation of new reconfigurable architectures to improve the efficiency of unrolled DNNs, while taking advantage of sparsity and mixed precision. To enable this, we present Kratos: a focused FPGA benchmark of unrolled DNN primitives with varying levels of sparsity and different arithmetic precisions. Our analysis reveals that unrolled DNNs can operate at very high frequencies, reaching the maximum frequency limit of an Arria 10 device. Additionally, we found that substantial area reductions can be achieved through fine-grained sparsity and low bit-width. We build on those results to tailor the FPGA fabric for unrolled DNNs through an architectural case study demonstrating $\sim$2$\times$ area reduction when using smaller LUT sizes within current FPGAs. This paves the way for further exploration of new programmable architectures that are purpose-built for sparse and low-precision unrolled DNNs. Our source code and benchmark are available on github.com/abdelfattah-lab/Kratos-benchmark.

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

TL;DR

Abstract

area reduction when using smaller LUT sizes within current FPGAs. This paves the way for further exploration of new programmable architectures that are purpose-built for sparse and low-precision unrolled DNNs. Our source code and benchmark are available on github.com/abdelfattah-lab/Kratos-benchmark.

Paper Structure (14 sections, 8 figures, 3 tables)

This paper contains 14 sections, 8 figures, 3 tables.

Introduction
Related Work
Benchmark Description
Kernels
Input Unrolling Factors
CAD for FPGA Architecture Exploration
Benchmark Workflow
Evaluation Methodology
Experimental Setup
Design Space
Experimental Results
Area and Frequency Trends on Arria 10
Architectural Exploration Case Study
Conclusions and Future Work

Figures (8)

Figure 1: Diagram of unrolled DNNs and the area of a 64×64 matrix multiplication on an FPGA. Naïve unrolling quickly utilizes most of the FPGA area (63%), but specialization , pruning , and quantization reduce area by 600$\times$ down to just 0.1% of the FPGA for 4096 effective FLOPs.
Figure 2: Dataflow of (a) GEMM and (b) convolution for different input unrolling factors: pixelwise, row-parallel, and fully-unrolled. The weight/filter is always fully unrolled
Figure 3: Hardware implementation of GEMM: (a) multiply-adder tree and (b) weight-stationary systolic array.
Figure 4: Logic block diagram of the baseline FPGA for VTR architectural exploration.
Figure 5: Normalized ALM utilization on Arria 10 vs. sparsity for (a) GEMM, (b) conv1d, and (c) conv2d kernels. The solid black line highlights the ideal trend where the ALM utilization linearly decreases with higher sparsity.
...and 3 more figures

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

TL;DR

Abstract

Kratos: An FPGA Benchmark for Unrolled DNNs with Fine-Grained Sparsity and Mixed Precision

Authors

TL;DR

Abstract

Table of Contents

Figures (8)