On the numerical reliability of nonsmooth autodiff: a MaxPool case study

Ryan Boustany

On the numerical reliability of nonsmooth autodiff: a MaxPool case study

Ryan Boustany

TL;DR

It is suggested that nonsmooth MaxPool Jacobians with lower norms help maintain stable and efficient test accuracy, whereas those with higher norms can result in instability and decreased performance.

Abstract

This paper considers the reliability of automatic differentiation (AD) for neural networks involving the nonsmooth MaxPool operation. We investigate the behavior of AD across different precision levels (16, 32, 64 bits) and convolutional architectures (LeNet, VGG, and ResNet) on various datasets (MNIST, CIFAR10, SVHN, and ImageNet). Although AD can be incorrect, recent research has shown that it coincides with the derivative almost everywhere, even in the presence of nonsmooth operations (such as MaxPool and ReLU). On the other hand, in practice, AD operates with floating-point numbers (not real numbers), and there is, therefore, a need to explore subsets on which AD can be numerically incorrect. These subsets include a bifurcation zone (where AD is incorrect over reals) and a compensation zone (where AD is incorrect over floating-point numbers but correct over reals). Using SGD for the training process, we study the impact of different choices of the nonsmooth Jacobian for the MaxPool function on the precision of 16 and 32 bits. These findings suggest that nonsmooth MaxPool Jacobians with lower norms help maintain stable and efficient test accuracy, whereas those with higher norms can result in instability and decreased performance. We also observe that the influence of MaxPool's nonsmooth Jacobians on learning can be reduced by using batch normalization, Adam-like optimizers, or increasing the precision level.

On the numerical reliability of nonsmooth autodiff: a MaxPool case study

TL;DR

Abstract

Paper Structure (62 sections, 1 theorem, 23 equations, 14 figures, 5 tables)

This paper contains 62 sections, 1 theorem, 23 equations, 14 figures, 5 tables.

Introduction
MaxPool: a nonsmooth operation
Various types of nonsmooth AD errors:
Reals vs floating-point numbers:
Implications for learning dynamics:
Related works and contributions:
Organization of the paper:
MaxPool neural networks and nonsmooth AD
Preliminaries and notations
Nonsmooth AD framework
Network parameters subsets
MaxPool-derived programs
A more general numerical bifurcation zone
A numerical criteria for the bifurcation and compensation zone
Numerical bifurcation zone for $\mathrm{ReLU}$ networks:
...and 47 more sections

Key Result

Proposition 1

Given subsets $\Theta_R$, $\Theta_B$, and $\Theta_C$ in $\mathbb{R}^p$ as defined in Definition def:networksubset, the following properties hold:

Figures (14)

Figure 1: Histogram of $\mathrm{backprop}$ variation $D_{m,q}$ for LeNet-5 on MNIST (128 mini-batch size) at 32-bit precision, comparing $P$ with $\tilde{P}$ and $P$ with $Q$ over $M = 1000$ experiments.
Figure 2: Histogram of $\mathrm{backprop}$ variation under nondeterministic GPU operations, where $f$ is a LeNet-5 network on MNIST with batch size 128 for $M =1000$ experiments.
Figure 3: Histogram of $\mathrm{backprop}$ variation with $\mathrm{ReLU}$-derived programs, where $f$ is a LeNet-5 network on MNIST with batch size 128 for $M = 1000$ experiments.
Figure 4: Impact of different size parameters on the proportion of affected mini-batches (see Equation (\ref{['eq:proportionbatch']}) using CIFAR10 dataset. First: Different VGG network sizes. Second: VGG11 with varying mini-batch sizes. Third: VGG11 with and without batch normalization.
Figure 5: Training a VGG network on CIFAR10 with SGD. We performed ten random initializations for each experiment, depicted by the boxplots and the filled contours (standard deviation).
...and 9 more figures

Theorems & Definitions (16)

Definition 1: Calculus model, programs and nonsmooth AD
Remark 1
Example 1
Definition 2: Backprop set
Remark 2
Remark 3
Definition 3: Compensation, bifurcation and regular zones
Proposition 1
Remark 4: Backprop returns a gradient a.e.
Definition 4: Clarke Jacobian of matrix’s maximum function
...and 6 more

On the numerical reliability of nonsmooth autodiff: a MaxPool case study

TL;DR

Abstract

On the numerical reliability of nonsmooth autodiff: a MaxPool case study

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (16)