Hardware-Efficient CNNs: Interleaved Approximate FP32 Multipliers for Kernel Computation
Bindu G Gowda, Yogesh Goyal, Yash Gupta, Madhav Rao
TL;DR
This work addresses the hardware cost of FP32 multipliers in CNN inference by designing compressor-based approximate FP32 mantissa multipliers and evaluating eight configurations of a Radix-8 Modified Booth multiplier. It shows that carefully placed approximate multipliers can reduce area and PDP while maintaining high accuracy, and introduces a novel fine-grained interleaving of multiplier types across CNN kernels, optimized with NSGA-II. A two-layer CNN trained on CIFAR-10 demonstrates that interleaving AMs can improve inference accuracy and generalization, with a reported 99.2% of outputs within a 1% error tolerance and substantial PDP gains. The study introduces a double-approximation framework that jointly optimizes hardware efficiency and CNN performance, offering a practical path to hardware-aware, energy-efficient CNN inference.
Abstract
Single-precision floating point (FP32) data format, defined by the IEEE 754 standard, is widely employed in scientific computing, signal processing, and deep learning training, where precision is critical. However, FP32 multiplication is computationally expensive and requires complex hardware, especially for precisely handling mantissa multiplication. In practical applications like neural network inference, perfect accuracy is not always necessary, minor multiplication errors often have little impact on final accuracy. This enables trading precision for gains in area, power, and speed. This work focuses on CNN inference using approximate FP32 multipliers, where the mantissa multiplication is approximated by employing error-variant approximate compressors, that significantly reduce hardware cost. Furthermore, this work optimizes CNN performance by employing differently approximated FP32 multipliers and studying their impact when interleaved within the kernels across the convolutional layers. The placement and ordering of these approximate multipliers within each kernel are carefully optimized using the Non-dominated Sorting Genetic Algorithm-II, balancing the trade-off between accuracy and hardware efficiency.
