Table of Contents
Fetching ...

Learning Capacity: A Measure of the Effective Dimensionality of a Model

Daiwei Chen, Wei-Kai Chang, Pratik Chaudhari

TL;DR

It is shown how the learning capacity can be used to provide a quantitative notion of capacity even for non-parametric models such as random forests and nearest neighbor classifiers.

Abstract

We use a formal correspondence between thermodynamics and inference, where the number of samples can be thought of as the inverse temperature, to study a quantity called ``learning capacity'' which is a measure of the effective dimensionality of a model. We show that the learning capacity is a useful notion of the complexity because (a) it correlates well with the test loss and it is a tiny fraction of the number of parameters for many deep networks trained on typical datasets, (b) it depends upon the number of samples used for training, (c) it is numerically consistent with notions of capacity obtained from PAC-Bayes generalization bounds, and (d) the test loss as a function of the learning capacity does not exhibit double descent. We show that the learning capacity saturates at very small and very large sample sizes; the threshold that characterizes the transition between these two regimes provides guidelines as to when one should procure more data and when one should search for a different architecture to improve performance. We show how the learning capacity can be used to provide a quantitative notion of capacity even for non-parametric models such as random forests and nearest neighbor classifiers.

Learning Capacity: A Measure of the Effective Dimensionality of a Model

TL;DR

It is shown how the learning capacity can be used to provide a quantitative notion of capacity even for non-parametric models such as random forests and nearest neighbor classifiers.

Abstract

We use a formal correspondence between thermodynamics and inference, where the number of samples can be thought of as the inverse temperature, to study a quantity called ``learning capacity'' which is a measure of the effective dimensionality of a model. We show that the learning capacity is a useful notion of the complexity because (a) it correlates well with the test loss and it is a tiny fraction of the number of parameters for many deep networks trained on typical datasets, (b) it depends upon the number of samples used for training, (c) it is numerically consistent with notions of capacity obtained from PAC-Bayes generalization bounds, and (d) the test loss as a function of the learning capacity does not exhibit double descent. We show that the learning capacity saturates at very small and very large sample sizes; the threshold that characterizes the transition between these two regimes provides guidelines as to when one should procure more data and when one should search for a different architecture to improve performance. We show how the learning capacity can be used to provide a quantitative notion of capacity even for non-parametric models such as random forests and nearest neighbor classifiers.
Paper Structure (37 sections, 45 equations, 9 figures, 1 table)

This paper contains 37 sections, 45 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: (a,c) show the average energy ($\overline U(N)$) for one-hidden-layer fully-connected network with 100 neurons and LeNet on MNIST (binary classification) and the ALLCNN and Wide-ResNet on CIFAR-10 (10-class classification in this case). (b,d) shows the learning capacity ($\overline C(N)$) estimated by fitting a seventh-order polynomial to $\overline U$ with constraints on it being monotonically decreasing and a constraint on $\overline C$ increasing monotonically.
  • Figure 2: Average energy $\overline U(N)$ and learning capacity $\overline C(N)$ for different architectures, datasets and input modalities. \ref{['tab:C']} shows the rightmost point on these plots. The average energy decreases monotonically as a function of the number of samples $N$ in the training dataset. The learning capacity, which indicates the number of constrained degrees of freedom in a model, increases with the number of training samples $N$. Architectures with a smaller learning capacity have a smaller test loss. Different architectures trained on the same dataset have different learning capacity. The learning capacity exhibits freezing at both very small and very large $N$ for many of these models. For synthetic data experiments in (c), the learning capacity of models with the same architecture is smaller if the task is easier (large values of $\kappa$).
  • Figure 3: Learning capacity $\overline C$ (solid lines) is numerically consistent with the effective dimensionality obtained from a PAC-Bayes bound in \ref{['eq:p_pac_bayes']} (dotted lines). Kendall rank correlation between the learning capacity and the PAC-Bayes effective dimensionality is 0.99 ($p$ = 4E-9) for the fully connected network and 0.2 ($p$ = 0.37) for LeNet.
  • Figure 4: (a) Left: double descent phenomenon for the test loss as a function of the number of weights for one and two-layer fully-connected networks with different numbers of hidden neurons (10--100) trained on MNIST with sample size $N$ = 50,000. (a) Right: double descent phenomenon for ResNet18 with different width (1-64) trained on CIFAR10 with sample size $N$ = 50,000 and with noise rate 0.2. (b): when plotted against the learning capacity the test loss does not exhibit double descent. Colors indicate different values of $N$ (blue: 200, orange: 1000, green: 2000, red: 4000, purple: 10000, and brown: 50,000). Slope of all regressions is non-zero and positive ($p <$ 0.005) except the brown curve for which the $p$-value is not significant.
  • Figure 5: Average energy $\overline U(N)$ and learning capacity $\overline C(N)$ for three tabular learning problems using random forests and $k$-nearest neighbor classifiers. In both cases, the Miniboone dataset has the best test loss and both models exhibit a lower learning capacity for this dataset (for the $k$-NN there is a large uncertainty in the estimate in this case). As the random forest learns the Numerai dataset, its capacity (which is reflective of the number of constrained degrees of freedom) increases sharply; in contrast the $k$-NN does not have a good test loss on this dataset and its capacity is much smaller.
  • ...and 4 more figures

Theorems & Definitions (1)

  • Remark 1