Table of Contents
Fetching ...

Mixed precision accumulation for neural network inference guided by componentwise forward error analysis

El-Mehdi El Arar, Silviu-Ioan Filip, Theo Mary, Elisa Riccietti

TL;DR

This work tackles reducing neural network inference costs by introducing a componentwise forward error analysis that links per-component errors to layer- and activation-function condition numbers. It then derives a practical mixed-precision accumulation strategy: compute layer outputs in low precision, estimate per-component condition numbers, and selectively recompute the most sensitive components in a higher precision to balance accuracy and efficiency. The proposed Algorithm 1, guided by a tunable tolerance, demonstrates favorable cost–accuracy tradeoffs on multilayer perceptrons with ReLU and tanh activations, achieving significant gains over uniform low-precision accumulation and competitive results with higher precision baselines. The study identifies practical limitations such as overflow considerations, τ parameter sensitivity, and dynamic recomputation overhead, and outlines future work including static precision configurations and extensions to convolutional networks and transformers.

Abstract

This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a linear layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost--accuracy tradeoff compared with uniform precision accumulation baselines.

Mixed precision accumulation for neural network inference guided by componentwise forward error analysis

TL;DR

This work tackles reducing neural network inference costs by introducing a componentwise forward error analysis that links per-component errors to layer- and activation-function condition numbers. It then derives a practical mixed-precision accumulation strategy: compute layer outputs in low precision, estimate per-component condition numbers, and selectively recompute the most sensitive components in a higher precision to balance accuracy and efficiency. The proposed Algorithm 1, guided by a tunable tolerance, demonstrates favorable cost–accuracy tradeoffs on multilayer perceptrons with ReLU and tanh activations, achieving significant gains over uniform low-precision accumulation and competitive results with higher precision baselines. The study identifies practical limitations such as overflow considerations, τ parameter sensitivity, and dynamic recomputation overhead, and outlines future work including static precision configurations and extensions to convolutional networks and transformers.

Abstract

This work proposes a mathematically founded mixed precision accumulation strategy for the inference of neural networks. Our strategy is based on a new componentwise forward error analysis that explains the propagation of errors in the forward pass of neural networks. Specifically, our analysis shows that the error in each component of the output of a linear layer is proportional to the condition number of the inner product between the weights and the input, multiplied by the condition number of the activation function. These condition numbers can vary widely from one component to the other, thus creating a significant opportunity to introduce mixed precision: each component should be accumulated in a precision inversely proportional to the product of these condition numbers. We propose a numerical algorithm that exploits this observation: it first computes all components in low precision, uses this output to estimate the condition numbers, and recomputes in higher precision only the components associated with large condition numbers. We test our algorithm on various networks and datasets and confirm experimentally that it can significantly improve the cost--accuracy tradeoff compared with uniform precision accumulation baselines.

Paper Structure

This paper contains 17 sections, 4 theorems, 37 equations, 7 figures, 1 table, 1 algorithm.

Key Result

Lemma 2.2

Let $A\in\mathbb{R}^{m\times n}$, $x\in\mathbb{R}^n$, and $\Delta x \in \mathbb{R}^n$. We have

Figures (7)

  • Figure 1.1: Illustration of our inference approach with mixed precision accumulation (Algorithm \ref{['alg:ALG1']}). At each layer $\ell$ we first compute the MMA $v_\ell = W_\ell h_{\ell-1}$ (where $h_{\ell-1}$ is the output of the previous layer) and the activation $h_\ell = \phi_\ell(v_\ell)$ (where $\phi_\ell$ is the activation function) in uniform low precision $\ul$ (blue). We estimate the condition number $\kappa_\ell$ and use it to decide which components can be kept in low precision (those for which $(\kappa_\ell)_i \le \tau$, for some tolerance $\tau$) and which must be recomputed in higher precision $u_\mathrm{high}$ (red); the latter are then requantized to low precision and recombined with the components kept in low precision to produce the final output of the layer, which is passed to the next layer.
  • Figure 2.1: Condition number $\kappa_\phi(x)=\frac{|\phi'(x)x|}{|\phi(x)|}$ for $\phi(x)=\text{ReLU}(x)$ (left) and $\phi(x)=\tanh(x)$ (right).
  • Figure 3.1: Comparison of the condition numbers $\kappa_\ell=\kappa_\phi\circ \kappa_{v_\ell}$ depending on whether they are computed in fp32 or in fp8, for a 3-layer network trained on the MNIST dataset with ReLU (left) and $\tanh$ (right) activations. The values are sorted with respect to the fp32 condition numbers.
  • Figure 3.2: Distribution of the components of the numerator $|W_\ell||h_{\ell-1}|$ (left) and of the denominator $|W_\ell h_{\ell-1}|$ (right) of $\kappa_{v_\ell}$ computed in fp8, for a three-layer network trained on the MNIST dataset with ReLU (top) and $\tanh$ (bottom) activations.
  • Figure 3.3: Comparison of the condition number $\kappa_\ell=\kappa_\phi(v_\ell)\circ \kappa_{v_\ell}$ and its proposed approximation $\kappa_\ell'=\kappa_\phi\circ \frac{c}{|W_\ell h_{\ell-1}|}$ (with $c=3$), both computed in fp8, for a three-layer network trained on the MNIST dataset with ReLU (left) and $\tanh$ (right) activations.
  • ...and 2 more figures

Theorems & Definitions (9)

  • Lemma 2.2
  • proof
  • Lemma 2.3
  • proof
  • Theorem 2.4
  • Corollary 2.5
  • proof
  • Remark 3.1
  • Remark 3.2