Table of Contents
Fetching ...

The loss landscape of overparameterized neural networks

Y Cooper

TL;DR

The paper analyzes the loss landscape of overparameterized neural networks and demonstrates that, for $n>d$, the set of global minima $M=L^{-1}(0)$ is generically an $n-d$-dimensional submanifold of $\mathbb{R}^n$, rather than a discrete set. Using Sard's theorem and a construction on the functions $f_i$ defining the loss, it shows that $M$ is either empty or a smooth submanifold of codimension $d$, with the Hessian at a minimum having $d$ positive directions and $n-d$ flat directions. In the feedforward case with last hidden layer width $h>d$ and rectified/smooth activations, $M$ remains nonempty and has the same high-dimensional structure, supported by explicit interpolation results for data memorization. These results align with observed Hessian spectra in practice and provide a theoretical basis for why overparameterized networks exhibit many equivalent minimizers and stable training dynamics across architectures.

Abstract

We explore some mathematical features of the loss landscape of overparameterized neural networks. A priori one might imagine that the loss function looks like a typical function from $\mathbb{R}^n$ to $\mathbb{R}$ - in particular, nonconvex, with discrete global minima. In this paper, we prove that in at least one important way, the loss function of an overparameterized neural network does not look like a typical function. If a neural net has $n$ parameters and is trained on $d$ data points, with $n>d$, we show that the locus $M$ of global minima of $L$ is usually not discrete, but rather an $n-d$ dimensional submanifold of $\mathbb{R}^n$. In practice, neural nets commonly have orders of magnitude more parameters than data points, so this observation implies that $M$ is typically a very high-dimensional subset of $\mathbb{R}^n$.

The loss landscape of overparameterized neural networks

TL;DR

The paper analyzes the loss landscape of overparameterized neural networks and demonstrates that, for , the set of global minima is generically an -dimensional submanifold of , rather than a discrete set. Using Sard's theorem and a construction on the functions defining the loss, it shows that is either empty or a smooth submanifold of codimension , with the Hessian at a minimum having positive directions and flat directions. In the feedforward case with last hidden layer width and rectified/smooth activations, remains nonempty and has the same high-dimensional structure, supported by explicit interpolation results for data memorization. These results align with observed Hessian spectra in practice and provide a theoretical basis for why overparameterized networks exhibit many equivalent minimizers and stable training dynamics across architectures.

Abstract

We explore some mathematical features of the loss landscape of overparameterized neural networks. A priori one might imagine that the loss function looks like a typical function from to - in particular, nonconvex, with discrete global minima. In this paper, we prove that in at least one important way, the loss function of an overparameterized neural network does not look like a typical function. If a neural net has parameters and is trained on data points, with , we show that the locus of global minima of is usually not discrete, but rather an dimensional submanifold of . In practice, neural nets commonly have orders of magnitude more parameters than data points, so this observation implies that is typically a very high-dimensional subset of .

Paper Structure

This paper contains 11 sections, 5 theorems, 28 equations.

Key Result

Theorem 2.1

In the setting described above, the set $M = L^{-1}(0)$ is generically (that is, possibly after an arbitrarily small change to the data set) a smooth $n-d$ dimensional submanifold (possibly empty) of $\mathbb R^n$.

Theorems & Definitions (14)

  • Theorem 2.1
  • Remark 2.2
  • proof
  • Proposition 2.3
  • proof
  • Definition 3.1
  • Definition 3.2
  • Lemma 3.3
  • proof
  • Corollary 3.4
  • ...and 4 more