MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Luca Bertaccini; Gianna Paulin; Tim Fischer; Stefan Mach; Luca Benini

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Luca Bertaccini, Gianna Paulin, Tim Fischer, Stefan Mach, Luca Benini

TL;DR

MiniFloat-NN is presented, a RISC-V instruction set architecture extension for low-precision NN training, providing support for two 8-bit and two 16-bit FP formats and expanding operations and implementing an ExSdotp unit to efficiently support in hardware both instruction types.

Abstract

Low-precision formats have recently driven major breakthroughs in neural network (NN) training and inference by reducing the memory footprint of the NN models and improving the energy efficiency of the underlying hardware architectures. Narrow integer data types have been vastly investigated for NN inference and have successfully been pushed to the extreme of ternary and binary representations. In contrast, most training-oriented platforms use at least 16-bit floating-point (FP) formats. Lower-precision data types such as 8-bit FP formats and mixed-precision techniques have only recently been explored in hardware implementations. We present MiniFloat-NN, a RISC-V instruction set architecture extension for low-precision NN training, providing support for two 8-bit and two 16-bit FP formats and expanding operations. The extension includes sum-of-dot-product instructions that accumulate the result in a larger format and three-term additions in two variations: expanding and non-expanding. We implement an ExSdotp unit to efficiently support in hardware both instruction types. The fused nature of the ExSdotp module prevents precision losses generated by the non-associativity of two consecutive FP additions while saving around 30% of the area and critical path compared to a cascade of two expanding fused multiply-add units. We replicate the ExSdotp module in a SIMD wrapper and integrate it into an open-source floating-point unit, which, coupled to an open-source RISC-V core, lays the foundation for future scalable architectures targeting low-precision and mixed-precision NN training. A cluster containing eight extended cores sharing a scratchpad memory, implemented in 12 nm FinFET technology, achieves up to 575 GFLOPS/W when computing FP8-to-FP16 GEMMs at 0.8 V, 1.26 GHz.

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

TL;DR

Abstract

Paper Structure (17 sections, 6 equations, 10 figures, 4 tables)

This paper contains 17 sections, 6 equations, 10 figures, 4 tables.

Introduction
Related Work
Floating-Point Formats for NN Training
Related Architectures
Architecture
Supported FP Formats
ExSdotp Unit
ExVsum and Vsum on the ExSdotp Datapath
SIMD Wrapper and Integration into FPnew
MiniFloat-NN PE
Experimental Results
Area and Timing
Performance
Power and Energy Efficiency
Accuracy
...and 2 more sections

Figures (10)

Figure 1: Relevant floating-point formats in the context of NN training. The exponent (green) and mantissa (blue) bitwidths are reported for each data type.
Figure 2: Register file utilization: ExFMA vs. ExSdotp. In the ExFMA case, only half of two source registers can be processed each cycle, while using ExSdotp allows for fully exploiting the information saved in the register file, and that can be passed to the FPU interface.
Figure 3: ExSdotp instruction computed by two ExFMA units vs. one dedicated ExSdotp unit. Note that the first solution actually computes $a*b + (c*d + e)$, which is not necessarily equal to $a*b + c*d + e$ when using FP arithmetic.
Figure 4: ExSdotp data flow: examples of how the information is packed at each stage of the datapath are provided in the blue boxes (note that we do not include the exponent datapath in this figure, nor the sticky bits that are generated after the various shifts). The ExSdotp operation takes four $w$-bit inputs and a $2w$-bit accumulator, while the Vsum takes three $2w$-bit inputs to produce a $2w$-bit output.
Figure 5: Block diagram of the extended FPU, with a zoom on the ExSdotp SIMD module.
...and 5 more figures

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

TL;DR

Abstract

MiniFloat-NN and ExSdotp: An ISA Extension and a Modular Open Hardware Unit for Low-Precision Training on RISC-V cores

Authors

TL;DR

Abstract

Table of Contents

Figures (10)