Table of Contents
Fetching ...

Deep Learning is Not So Mysterious or Different

Andrew Gordon Wilson

TL;DR

The paper argues that deep learning generalization phenomena are not inherently mysterious or unique to neural networks. By adopting soft inductive biases—flexible hypothesis spaces with a bias toward simpler solutions—and applying long-standing generalization frameworks such as PAC-Bayes and countable bounds, the authors show how benign overfitting, overparametrization, and double descent can be understood and bounded. They acknowledge distinctive aspects of DL, like representation learning, universal learning, and mode connectivity, while emphasizing these do not undermine the explanatory power of classical theories. The work advocates bridging communities to leverage well-established theory for understanding modern models and suggests empirical evaluation of bounds as a practical diagnostic tool.

Abstract

Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized, using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

Deep Learning is Not So Mysterious or Different

TL;DR

The paper argues that deep learning generalization phenomena are not inherently mysterious or unique to neural networks. By adopting soft inductive biases—flexible hypothesis spaces with a bias toward simpler solutions—and applying long-standing generalization frameworks such as PAC-Bayes and countable bounds, the authors show how benign overfitting, overparametrization, and double descent can be understood and bounded. They acknowledge distinctive aspects of DL, like representation learning, universal learning, and mode connectivity, while emphasizing these do not undermine the explanatory power of classical theories. The work advocates bridging communities to leverage well-established theory for understanding modern models and suggests empirical evaluation of bounds as a practical diagnostic tool.

Abstract

Deep neural networks are often seen as different from other model classes by defying conventional notions of generalization. Popular examples of anomalous generalization behaviour include benign overfitting, double descent, and the success of overparametrization. We argue that these phenomena are not distinct to neural networks, or particularly mysterious. Moreover, this generalization behaviour can be intuitively understood, and rigorously characterized, using long-standing generalization frameworks such as PAC-Bayes and countable hypothesis bounds. We present soft inductive biases as a key unifying principle in explaining these phenomena: rather than restricting the hypothesis space to avoid overfitting, embrace a flexible hypothesis space, with a soft preference for simpler solutions that are consistent with the data. This principle can be encoded in many model classes, and thus deep learning is not as mysterious or different from other model classes as it might seem. However, we also highlight how deep learning is relatively distinct in other ways, such as its ability for representation learning, phenomena such as mode connectivity, and its relative universality.

Paper Structure

This paper contains 26 sections, 2 theorems, 13 equations, 7 figures, 1 table.

Key Result

Theorem 3.1

Consider a bounded risk $R(h,x) \in [a,a+\Delta]$, and a countable hypothesis space $h\in \mathcal{H}$ for which we have a prior $P(h)$. Let the empirical risk $\hat{R}(h) = \frac{1}{n}\sum_{i=1}^n R(h,x_i)$ be a sum over independent random variables $R(h,x_i)$ for a fixed hypothesis $h$. Let $R(h)

Figures (7)

  • Figure 1: Generalization phenomena associated with deep learning can be reproduced with simple linear models and understood. Top: Benign Overfitting. A $150th$ order polynomial with order-dependent regularization reasonably describes (a) simple and (b) complex structured data, while also being able to perfectly fit (c) pure noise. (d) A Gaussian process exactly reproduces the CIFAR-10 results in zhang2016understanding, perfectly fitting noisy labels, but still achieving reasonable generalization. Moreover, for both the GP and (e) ResNet, the marginal likelihood, directly corresponding to PAC-Bayes bounds germain2016pac, decreases with more altered labels, as in wilson2020bayesian. Bottom: Double Descent. Both the (f) ResNet and (g) linear random feature model display double descent, with effective dimensionality closely tracking the second descent in the low training loss regime as in maddox2020rethinking.
  • Figure 2: Generalization phenomena can be formally characterized by generalization bounds. Generalization can be upper bounded by the empirical risk and compressibility of a hypothesis $h$, as in Section \ref{['sec: pacbayes']}. The compressibility, formalized in terms of Kolmogorov complexity $K(h)$, can be further upper bounded by a model's filesize. Large models fit the data well, and can be effectively compressed to small filesizes. Unlike Rademacher complexity, these bounds do not penalize a model for having a hypothesis space $\mathcal{H}$ that can fit noise, and describe benign overfitting, double descent, and overparametrization. They can even provide non-vacuous bounds on LLMs, as in lotfi2023llm above.
  • Figure 3: Soft inductive biases enable flexible hypothesis spaces without overfitting. Many generalization phenomena can be understood through the notion of soft inductive biases: rather than restricting the solutions a model can represent, specify a preference for certain solutions over others. In this conceptualization, we enlarge the hypothesis space with hypotheses that have lower preference in lighter blue, rather than restricting them entirely. There are many ways to implement soft inductive biases. Rather than use a low order polynomial, use a high order polynomial with order-dependent regularization. Alternatively, rather than restrict a model to translation equivariance (e.g., ConvNet), have a preference for invariances through a compression bias (e.g., a transformer, or RPP with ConvNet bias). Overparametrization is yet another way to implement a soft bias.
  • Figure 4: Achieving good generalization with soft inductive biases.Left: A large hypothesis space, but no preference amongst solutions that provide the same fit to the data. Therefore, training will often lead to overfit solutions that generalize poorly. Middle: Soft inductive biases guide training towards good generalization by representing a flexible hypothesis space in combination with preferences between solutions, represented by different shades. Right: Restricting the hypothesis space can help prevent overfitting by only considering solutions that have certain desirable properties. However, by limiting expressiveness, the model cannot capture the nuances of reality, hindering generalization.
  • Figure 5: Flexibility with a simplicity bias can be appropriate for varying data sizes and complexities. We use 2nd, 15th, and regularized 15th order polynomials to fit three regression problems with varying training data sizes, generated from the functions described in (a)-(c). We use a special regularization penalty that increases with the order of the polynomial coefficient. We show the average performance $\pm$ 1 standard deviation over 100 fits of 100 test samples. By increasing complexity only as needed to fit the data, the regularized 15th order polynomial is as good or better than all other models for all data sizes and problems of varying complexity.
  • ...and 2 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Countable Hypothesis Bound
  • Theorem 3.1
  • proof : Proof lotfi2023llm