On the Variance of Neural Network Training with respect to Test Sets and Distributions

Keller Jordan

On the Variance of Neural Network Training with respect to Test Sets and Distributions

Keller Jordan

TL;DR

It is shown that standard CIFAR-10 and ImageNet trainings have little variance in performance on the underlying test-distributions from which their test-sets are sampled, and it is proved that the variance of neural network trainings on their test-sets is a downstream consequence of the class-calibration property discovered by Jiang et al. (2021).

Abstract

Typical neural network trainings have substantial variance in test-set performance between repeated runs, impeding hyperparameter comparison and training reproducibility. In this work we present the following results towards understanding this variation. (1) Despite having significant variance on their test-sets, we demonstrate that standard CIFAR-10 and ImageNet trainings have little variance in performance on the underlying test-distributions from which their test-sets are sampled. (2) We show that these trainings make approximately independent errors on their test-sets. That is, the event that a trained network makes an error on one particular example does not affect its chances of making errors on other examples, relative to their average rates over repeated runs of training with the same hyperparameters. (3) We prove that the variance of neural network trainings on their test-sets is a downstream consequence of the class-calibration property discovered by Jiang et al. (2021). Our analysis yields a simple formula which accurately predicts variance for the binary classification case. (4) We conduct preliminary studies of data augmentation, learning rate, finetuning instability and distribution-shift through the lens of variance between runs.

On the Variance of Neural Network Training with respect to Test Sets and Distributions

TL;DR

Abstract

Paper Structure (29 sections, 11 theorems, 26 equations, 14 figures)

This paper contains 29 sections, 11 theorems, 26 equations, 14 figures.

Introduction
Related work
Setup
The statistical structure of neural network errors
Do lucky random seeds generalize?
Errors are approximately independent
Distribution-wise variance is small
Variation can be predicted from class-wise calibration
Additional experiments
The effect of finetuning instability
The effect of data augmentation
The effect of learning rate
The effect of distribution shift
Discussion
Training details
...and 14 more sections

Key Result

Theorem 1

In expectation, the variance in test-set accuracy overestimates the variance in true error.

Figures (14)

Figure 1: Accuracy distributions. The test-set accuracy distributions across our four training durations, displayed as unsmoothed histograms for 60,000 repeated runs of training each. The differences between the "luckiest" and most unlucky run (max minus min accuracy) are 13.2%, 6.6%, 1.7%, and 1.4% for the 0, 4, 16, and 64-epoch training durations, respectively. The standard deviations are 1.87%, 0.56%, 0.19%, and 0.15%.
Figure 2: Error rates on disjoint splits of test data become decorrelated when training to convergence. We evaluate a large number of independently trained networks on two splits of the CIFAR-10 test-set. When under-training there is substantial correlation, so that a "lucky" run which over-performs on the first split is also likely to achieve higher-than-average accuracy on the second. As we increase the training duration, the two error rates decorrelate from each other.
Figure 3: Independent errors explain variance when training to convergence. (Left:) We compare the empirical distribution of test-set accuracy with that generated by simulating an equal number of samples assuming the hypothesis of independent errors. The hypothesis is wrong for short trainings, but becomes a close fit as training progresses. (Right:) The hypothesis accurately predicts variance when training to convergence.
Figure 4: A pair with independent errors. (Left) is image 776 of the CIFAR-10 test-set. Out of 60,000 independent runs of 64-epoch training, 21,736 networks (36.2%) correctly predict this example. (Right) is image 796, which is correctly predicted by 36,392 networks (60.7%). The number of networks which predict both correctly at the same time is 13,103 (21.83%), which has a statistically insignificant difference to the quantity $0.362 \cdot 0.607 = 21.97\%$, which is the predicted value if their errors are independent.
Figure 5: Test-set variance overestimates distribution-wise variance. We use Equation \ref{['eq:estimate_distributionwise']} to estimate the distribution-wise variance $\mathrm{Var}_{h \sim \mathcal{H}_{\mathcal{A}}}(\mathrm{err}(h))$. It becomes 20$\times$ smaller than the test-set variance when training to convergence.
...and 9 more figures

Theorems & Definitions (20)

Definition 1
Theorem 1
Theorem 2
Definition 2
Theorem 3
Theorem 4
Lemma 1
proof
Lemma 2
proof
...and 10 more

On the Variance of Neural Network Training with respect to Test Sets and Distributions

TL;DR

Abstract

On the Variance of Neural Network Training with respect to Test Sets and Distributions

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (20)