Floating-Point Multiply-Add with Approximate Normalization for Low-Cost Matrix Engines
Kosmas Alexandridis, Christodoulos Peltekis, Dionysios Filippas, Giorgos Dimitrakopoulos
TL;DR
The paper tackles the hardware cost of floating-point normalization in matrix engines used for transformer workloads. It introduces approximate normalization within FP multiply-add units, controlled by small bit-parameter settings, to reduce area and power while maintaining accuracy on practical ML tasks. Empirical results show substantial hardware savings (approximately 14–19% area and 10–14% power at 28 nm and 1 GHz) with average transformer accuracy losses around 1% for favorable configurations, and up to 7.2% in less favorable ones. This approach enables energy-efficient, high-throughput FP matrix engines suitable for low-cost ML accelerators without sacrificing model performance.
Abstract
The widespread adoption of machine learning algorithms necessitates hardware acceleration to ensure efficient performance. This acceleration relies on custom matrix engines that operate on full or reduced-precision floating-point arithmetic. However, conventional floating-point implementations can be power hungry. This paper proposes a method to improve the energy efficiency of the matrix engines used in machine learning algorithm acceleration. Our approach leverages approximate normalization within the floating-point multiply-add units as a means to reduce their hardware complexity, without sacrificing overall machine-learning model accuracy. Hardware synthesis results show that this technique reduces area and power consumption roughly by 16% and 13% on average for Bfloat16 format. Also, the error introduced in transformer model accuracy is 1% on average, for the most efficient configuration of the proposed approach.
