Revealing Untapped DSP Optimization Potentials for FPGA-Based Systolic Matrix Engines

Jindong Li; Tenglong Li; Guobin Shen; Dongcheng Zhao; Qian Zhang; Yi Zeng

Revealing Untapped DSP Optimization Potentials for FPGA-Based Systolic Matrix Engines

Jindong Li, Tenglong Li, Guobin Shen, Dongcheng Zhao, Qian Zhang, Yi Zeng

TL;DR

This paper unveils several previously untapped DSP optimization techniques capable of further enhancing FPGA-based systolic matrix engines and applies them to two well-known systolic architectures: Google TPUv1 and Xilinx Vitis AI DPU.

Abstract

Systolic architectures are widely embraced by neural network accelerators for their superior performance in highly parallelized computation. The DSP48E2s serve as dedicated arithmetic blocks in Xilinx Ultrascale series FPGAs and constitute a fundamental component in FPGA-based systolic matrix engines. Harnessing the full potential of DSP48E2s in architectural design can result in significant performance enhancements for systolic architectures on Ultrascale series FPGAs. This paper unveils several previously untapped DSP optimization techniques capable of further enhancing FPGA-based systolic matrix engines. We apply these techniques to two well-known systolic architectures: Google TPUv1 and Xilinx Vitis AI DPU. With the proposed techniques, our design achieves substantial resource and power reduction compared to the open-source TPUv1 FPGA implementation and the Vitis AI DPU implementation in the same parallelism setting. We also demonstrate the applicability of our techniques to neuromorphic hardware for supporting spiking neural network acceleration.

Revealing Untapped DSP Optimization Potentials for FPGA-Based Systolic Matrix Engines

TL;DR

Abstract

Paper Structure (15 sections, 8 figures, 3 tables)

This paper contains 15 sections, 8 figures, 3 tables.

Introduction
Related Works
DSP48E2 Overview
Enhancing Systolic Engine of TPUv1 on FPGA
Drawbacks in Existing Implementations
Enhancement: In-DSP Operand Prefetching
Experiments
Enhancing Systolic Engine of Xlinx DPU
Drawbacks in DPUCZDX8G's Systolic Engine
Enhancement: In-DSP Multiplexing
Enhancement: Ring Accumulator
Experiments
Applicability on SNN Accelerator
Conclusion
Acknowledgement

Figures (8)

Figure 1: DSP48E2 Overview. Green: four wide input ports. Purple: two flexible input pipelines. Orange: four dynamic multiplexers. Blue: three cascade paths.
Figure 2: A) Google TPUv1-like systolic matrix engine. B) A single PE column of the proposed TPUv1-like systolic engine. The blue color in B indicates the weight prefetching path that is completely absorbed into the DSP48E2.
Figure 3: The proposed in-DSP operand prefetching technique. The cascaded $B_1$ registers form the shared weight prefetching path. When data stored in the $B_2$ registers expires, the data in the $B_1$ registers shifts into $B_2$.
Figure 4: A) B1024 systolic engine of the DPUCZDX8G . B) The PE of DPUCZDX8G's systolic engine. C) Proposed enhanced systolic matrix engine. D) The PE of the proposed enhanced systolic engine. The blue color indicates the $Clk_{\times 1}$ clock domain, while the purple color indicates the $Clk_{\times 2}$ clock domain.
Figure 5: The proposed in-DSP multiplexing technique.
...and 3 more figures

Revealing Untapped DSP Optimization Potentials for FPGA-Based Systolic Matrix Engines

TL;DR

Abstract

Revealing Untapped DSP Optimization Potentials for FPGA-Based Systolic Matrix Engines

Authors

TL;DR

Abstract

Table of Contents

Figures (8)