Table of Contents
Fetching ...

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

Nikhil Rout, Blaise Tine

TL;DR

Addresses the need for high-throughput mixed-precision dot products in open-source GPGPU RTL by integrating FP and integer pipelines into a single fused 4-stage unit. The authors present a unified datapath with low-precision multipliers (FP16/BF16/FP8/BF8/INT8/UINT4) and FP32/INT32 accumulation, plus exponent alignment and MOD-4 CSA-based accumulation, all fused to avoid arbitration. Experimental results show competitive latency (4 cycles) and clocking (306.6 MHz) with substantial throughput improvements over discrete HardFloat and Xilinx DSP backends, along with significant area and register reductions. The work enables scalable, configurable mixed-precision tensor computation in an open-source GPGPU core and points to future extensions like sparseFEDP and MX formats.

Abstract

Efficient mixed-precision MMA operations are critical for accelerating Deep Learning workloads on GPGPUs. However, existing open-source RTL implementations of inner dot products rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in (FP16/BF16/FP8/BF8/INT8/UINT4) formats and higher-precision accumulation in (FP32/INT32), with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.

A Configurable Mixed-Precision Fused Dot Product Unit for GPGPU Tensor Computation

TL;DR

Addresses the need for high-throughput mixed-precision dot products in open-source GPGPU RTL by integrating FP and integer pipelines into a single fused 4-stage unit. The authors present a unified datapath with low-precision multipliers (FP16/BF16/FP8/BF8/INT8/UINT4) and FP32/INT32 accumulation, plus exponent alignment and MOD-4 CSA-based accumulation, all fused to avoid arbitration. Experimental results show competitive latency (4 cycles) and clocking (306.6 MHz) with substantial throughput improvements over discrete HardFloat and Xilinx DSP backends, along with significant area and register reductions. The work enables scalable, configurable mixed-precision tensor computation in an open-source GPGPU core and points to future extensions like sparseFEDP and MX formats.

Abstract

Efficient mixed-precision MMA operations are critical for accelerating Deep Learning workloads on GPGPUs. However, existing open-source RTL implementations of inner dot products rely on discrete arithmetic units, leading to suboptimal throughput and poor resource utilization. To address these challenges, we propose a scalable mixed-precision dot product unit that integrates floating-point and integer arithmetic pipelines within a singular fused architecture, implemented as part of the open-source RISC-V based Vortex GPGPU's Tensor Core Unit extension. Our design supports low-precision multiplication in (FP16/BF16/FP8/BF8/INT8/UINT4) formats and higher-precision accumulation in (FP32/INT32), with an extensible framework for adding and evaluating other custom representations in the future. Experimental results demonstrate 4-cycle operation latency at 306.6 MHz clock frequency on the AMD Xilinx Alveo U55C FPGA, delivering an ideal filled pipeline throughput of 9.812 GFLOPS in a 4-thread per warp configuration.

Paper Structure

This paper contains 10 sections, 1 equation, 2 figures, 1 table.

Figures (2)

  • Figure 1: GPGPU Mixed-Precision Fused Dot Product Unit 4-Stage Pipeline Microarchitecture
  • Figure 2: FEDP Backends Performance Scaling (FP16/BF16)