Table of Contents
Fetching ...

Floating-Point Multiply-Add with Approximate Normalization for Low-Cost Matrix Engines

Kosmas Alexandridis, Christodoulos Peltekis, Dionysios Filippas, Giorgos Dimitrakopoulos

TL;DR

The paper tackles the hardware cost of floating-point normalization in matrix engines used for transformer workloads. It introduces approximate normalization within FP multiply-add units, controlled by small bit-parameter settings, to reduce area and power while maintaining accuracy on practical ML tasks. Empirical results show substantial hardware savings (approximately 14–19% area and 10–14% power at 28 nm and 1 GHz) with average transformer accuracy losses around 1% for favorable configurations, and up to 7.2% in less favorable ones. This approach enables energy-efficient, high-throughput FP matrix engines suitable for low-cost ML accelerators without sacrificing model performance.

Abstract

The widespread adoption of machine learning algorithms necessitates hardware acceleration to ensure efficient performance. This acceleration relies on custom matrix engines that operate on full or reduced-precision floating-point arithmetic. However, conventional floating-point implementations can be power hungry. This paper proposes a method to improve the energy efficiency of the matrix engines used in machine learning algorithm acceleration. Our approach leverages approximate normalization within the floating-point multiply-add units as a means to reduce their hardware complexity, without sacrificing overall machine-learning model accuracy. Hardware synthesis results show that this technique reduces area and power consumption roughly by 16% and 13% on average for Bfloat16 format. Also, the error introduced in transformer model accuracy is 1% on average, for the most efficient configuration of the proposed approach.

Floating-Point Multiply-Add with Approximate Normalization for Low-Cost Matrix Engines

TL;DR

The paper tackles the hardware cost of floating-point normalization in matrix engines used for transformer workloads. It introduces approximate normalization within FP multiply-add units, controlled by small bit-parameter settings, to reduce area and power while maintaining accuracy on practical ML tasks. Empirical results show substantial hardware savings (approximately 14–19% area and 10–14% power at 28 nm and 1 GHz) with average transformer accuracy losses around 1% for favorable configurations, and up to 7.2% in less favorable ones. This approach enables energy-efficient, high-throughput FP matrix engines suitable for low-cost ML accelerators without sacrificing model performance.

Abstract

The widespread adoption of machine learning algorithms necessitates hardware acceleration to ensure efficient performance. This acceleration relies on custom matrix engines that operate on full or reduced-precision floating-point arithmetic. However, conventional floating-point implementations can be power hungry. This paper proposes a method to improve the energy efficiency of the matrix engines used in machine learning algorithm acceleration. Our approach leverages approximate normalization within the floating-point multiply-add units as a means to reduce their hardware complexity, without sacrificing overall machine-learning model accuracy. Hardware synthesis results show that this technique reduces area and power consumption roughly by 16% and 13% on average for Bfloat16 format. Also, the error introduced in transformer model accuracy is 1% on average, for the most efficient configuration of the proposed approach.
Paper Structure (9 sections, 7 figures, 1 table)

This paper contains 9 sections, 7 figures, 1 table.

Figures (7)

  • Figure 1: Floating point formats: Single and reduced precision.
  • Figure 2: a) The organization of systolic arrays and b) the corresponding weight-stationary dataflow.
  • Figure 3: The pipelined structure of a fused multiply-add PE for reduced-precision Bfloat16 numbers that uses accurate normalization logic that includes Leading-Zero Anticipation and counting (LZA), the normalization shifter and the sign and exponent correction logic. Significands of $A$ and $B$ (7 mantissa bits plus one hidden bit) are 8-bit. The significand of partial sum $C$ and the output assume double bitwidth of 16 bits.
  • Figure 4: The area contribution of each major component of the floating-pointing PE of Fig \ref{['f:mul-add-pe']}. Inputs and outputs follow Bfloat16 representation. However, partial sum $C$ and the output of the PE assume 16-bit significands for preserving precision in the addition step executed in each column of the SA. Flip-flops (FFs) refer to the area contribution of the pipeline registers.
  • Figure 5: The logic that implements approximate normalization. The result of the addition ('sum') is either not shifted, or shifted for $k$ or $k+\lambda$ positions.
  • ...and 2 more figures