A Power-Efficient Hardware Implementation of L-Mul
Ruiqi Chen, Yangxintong Lyu, Han Bao, Bruno da Silva
TL;DR
This work presents a power-efficient FPGA implementation of the L-Mul approximate FP8 multiplier for neural network inference. It leverages LUT and carry-chain primitives on AMD Xilinx UltraScale/UltraScale+ devices to implement exponent and mantissa adders, with a piecewise L-Mul approximation that reduces multiplications to shifts/adders. The design achieves low resource usage and favorable energy efficiency, outperforming comparable 8-bit designs and enabling DSP-free CNN and GCN accelerators with validated accuracy trade-offs on standard datasets. The results demonstrate practical impact for energy-constrained NN workloads and suggest applicability to larger models such as LLMs and diffusion models.
Abstract
Multiplication is a core operation in modern neural network (NN) computations, contributing significantly to energy consumption. The linear-complexity multiplication (L-Mul) algorithm is specifically proposed as an approximate multiplication method for emerging NN models, such as large language model (LLM), to reduce the energy consumption and computational complexity of multiplications. However, hardware implementation designs for L-Mul have not yet been reported. Additionally, 8-bit floating-point (FP8), as an emerging data format, offers a better dynamic range compared to traditional 8-bit integer (INT8), making it increasingly popular and widely adopted in NN computations. This paper thus presents a power-efficient FPGAbased hardware implementation (approximate FP8 multiplier) for L-Mul. The core computation is implemented using the dynamic reconfigurable lookup tables and carry chains primitives available in AMD Xilinx UltraScale/UltraScale+ technology. The accuracy and resource utilization of the approximate multiplier are evaluated and analyzed. Furthermore, the FP8 approximate multiplier is deployed in the inference phase of representative NN models to validate its effectiveness.
