Table of Contents
Fetching ...

Hardware-Efficient CNNs: Interleaved Approximate FP32 Multipliers for Kernel Computation

Bindu G Gowda, Yogesh Goyal, Yash Gupta, Madhav Rao

TL;DR

This work addresses the hardware cost of FP32 multipliers in CNN inference by designing compressor-based approximate FP32 mantissa multipliers and evaluating eight configurations of a Radix-8 Modified Booth multiplier. It shows that carefully placed approximate multipliers can reduce area and PDP while maintaining high accuracy, and introduces a novel fine-grained interleaving of multiplier types across CNN kernels, optimized with NSGA-II. A two-layer CNN trained on CIFAR-10 demonstrates that interleaving AMs can improve inference accuracy and generalization, with a reported 99.2% of outputs within a 1% error tolerance and substantial PDP gains. The study introduces a double-approximation framework that jointly optimizes hardware efficiency and CNN performance, offering a practical path to hardware-aware, energy-efficient CNN inference.

Abstract

Single-precision floating point (FP32) data format, defined by the IEEE 754 standard, is widely employed in scientific computing, signal processing, and deep learning training, where precision is critical. However, FP32 multiplication is computationally expensive and requires complex hardware, especially for precisely handling mantissa multiplication. In practical applications like neural network inference, perfect accuracy is not always necessary, minor multiplication errors often have little impact on final accuracy. This enables trading precision for gains in area, power, and speed. This work focuses on CNN inference using approximate FP32 multipliers, where the mantissa multiplication is approximated by employing error-variant approximate compressors, that significantly reduce hardware cost. Furthermore, this work optimizes CNN performance by employing differently approximated FP32 multipliers and studying their impact when interleaved within the kernels across the convolutional layers. The placement and ordering of these approximate multipliers within each kernel are carefully optimized using the Non-dominated Sorting Genetic Algorithm-II, balancing the trade-off between accuracy and hardware efficiency.

Hardware-Efficient CNNs: Interleaved Approximate FP32 Multipliers for Kernel Computation

TL;DR

This work addresses the hardware cost of FP32 multipliers in CNN inference by designing compressor-based approximate FP32 mantissa multipliers and evaluating eight configurations of a Radix-8 Modified Booth multiplier. It shows that carefully placed approximate multipliers can reduce area and PDP while maintaining high accuracy, and introduces a novel fine-grained interleaving of multiplier types across CNN kernels, optimized with NSGA-II. A two-layer CNN trained on CIFAR-10 demonstrates that interleaving AMs can improve inference accuracy and generalization, with a reported 99.2% of outputs within a 1% error tolerance and substantial PDP gains. The study introduces a double-approximation framework that jointly optimizes hardware efficiency and CNN performance, offering a practical path to hardware-aware, energy-efficient CNN inference.

Abstract

Single-precision floating point (FP32) data format, defined by the IEEE 754 standard, is widely employed in scientific computing, signal processing, and deep learning training, where precision is critical. However, FP32 multiplication is computationally expensive and requires complex hardware, especially for precisely handling mantissa multiplication. In practical applications like neural network inference, perfect accuracy is not always necessary, minor multiplication errors often have little impact on final accuracy. This enables trading precision for gains in area, power, and speed. This work focuses on CNN inference using approximate FP32 multipliers, where the mantissa multiplication is approximated by employing error-variant approximate compressors, that significantly reduce hardware cost. Furthermore, this work optimizes CNN performance by employing differently approximated FP32 multipliers and studying their impact when interleaved within the kernels across the convolutional layers. The placement and ordering of these approximate multipliers within each kernel are carefully optimized using the Non-dominated Sorting Genetic Algorithm-II, balancing the trade-off between accuracy and hardware efficiency.

Paper Structure

This paper contains 7 sections, 1 equation, 5 figures, 2 tables.

Figures (5)

  • Figure 1: Dot diagram representation of $24\times24$ Radix-8 modified approximate booth multiplier with PMCSI configuration
  • Figure 2: PDP and CNN inference accuracy for: (a) a single multiplier across all kernels, and (b) 'K' multiplier types assigned per NSGA-II-optimized sequence across convolutional layers.
  • Figure 3: Representation of the convolution operation within a layer, using multiplier-interleaved kernels.
  • Figure 4: Pareto-optimal solutions resulting from NSGA-II algorithm for three different 'K' values.
  • Figure 5: Illustration of randomly displaced multipliers within the optimal sequence, shown for a simplified case with 9 slots (Seq X) and K = 4 multiplier types.