Table of Contents
Fetching ...

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Kedar Karhadkar, Michael Murray, Hanna Tseran, Guido Montúfar

TL;DR

This paper analyzes the loss landscapes of mildly overparameterized ReLU networks on finite datasets under squared loss, revealing that most activation regions contain no bad local minima and often host high-dimensional global minima. It develops Jacobian-rank based arguments and combinatorial region counting to characterize when differentiable critical points are global optima, and provides explicit results for shallow two-layer networks as well as one-dimensional inputs. The results extend to deep networks under mild width conditions and include volume-based bounds via anticoncentration arguments, with experimental evidence showing phase transitions in the prevalence of full-rank Jacobians. Collectively, the work suggests that realistic levels of overparameterization yield substantially benign optimization landscapes, independent of initialization or data distribution, though open questions remain for intermediate widths and deeper architectures.

Abstract

We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization.

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

TL;DR

This paper analyzes the loss landscapes of mildly overparameterized ReLU networks on finite datasets under squared loss, revealing that most activation regions contain no bad local minima and often host high-dimensional global minima. It develops Jacobian-rank based arguments and combinatorial region counting to characterize when differentiable critical points are global optima, and provides explicit results for shallow two-layer networks as well as one-dimensional inputs. The results extend to deep networks under mild width conditions and include volume-based bounds via anticoncentration arguments, with experimental evidence showing phase transitions in the prevalence of full-rank Jacobians. Collectively, the work suggests that realistic levels of overparameterization yield substantially benign optimization landscapes, independent of initialization or data distribution, though open questions remain for intermediate widths and deeper architectures.

Abstract

We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization.
Paper Structure (25 sections, 25 theorems, 170 equations, 9 figures)

This paper contains 25 sections, 25 theorems, 170 equations, 9 figures.

Key Result

Lemma 1

Let $\mathcal{V}$ be a proper algebraic subset of $\mathbb{R}^n$. Then $\mathcal{V}$ has Lebesgue measure $0$.

Figures (9)

  • Figure 1: The probability of the Jacobian being of full rank from a random initialization for various values of $d_1$ and $n$, where the input dimension $d_0$ is left fixed.
  • Figure 2: The probability of the Jacobian being of full rank for various values of $d_1$ and $n$, where the input dimension $d_0$ scales linearly in the number of samples $n$.
  • Figure 3: Illustration of Proposition \ref{['prop:identity-non-empty']}. The polytope $P$ for a ReLU on three data points $x^{(1)}, x^{(2)}, x^{(3)}$ is the Minkowski sum of the line segments $P_i=\operatorname{conv}\{0,x^{(i)}\}$ highlighted in red. The activation regions in parameter space are the normal cones of $P$ at its different vertices. Hence the vertices correspond to the non-empty activation regions. These are naturally labeled by vectors $\mathbf{1}_S$ that indicate which $x^{(i)}$ are added to produce the vertex and record the activation patterns.
  • Figure 4: Function space of a ReLU on $n$ data points in $\mathbb{R}$, for $n=3,4$. The function space is a polyhedral cone in the non-negative orthant of $\mathbb{R}^n$. We can represent this, up to scaling by non-negative factors, by functions $f=(f_1,\ldots, f_n)$ with $f_1+\cdots+f_n=1$. These form a polyline, shown in red, inside the $(n-1)$-simplex. The sum of $m$ ReLUs corresponds to non-negative multiples of convex combinations of any $m$ points in the polyline, and arbitrary linear combinations of $m$ ReLUs correspond to arbitrary scalar multiples of affine combinations of any $m$ points in this polyline.
  • Figure 5: Subdivision of the parameter space of a single ReLU on two data points $x^{(1)}, x^{(2)}$ in $1\times \mathbb{R}^1$ by values of the Jacobian (left) and corresponding pieces of the function space in $\mathbb{R}^2$ (right). The activation regions are intersections of half-spaces with activation patterns indicating the positive ones or, equivalently, the indices of data points where the unit is active.
  • ...and 4 more figures

Theorems & Definitions (48)

  • Lemma 1
  • Lemma 2
  • proof
  • Theorem 3
  • Lemma 4
  • Theorem 5
  • Proposition 6: Number of non-empty regions
  • Corollary 7
  • Proposition 8: Identity of non-empty regions
  • Lemma 9
  • ...and 38 more