Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Kedar Karhadkar; Michael Murray; Hanna Tseran; Guido Montúfar

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Kedar Karhadkar, Michael Murray, Hanna Tseran, Guido Montúfar

TL;DR

This paper analyzes the loss landscapes of mildly overparameterized ReLU networks on finite datasets under squared loss, revealing that most activation regions contain no bad local minima and often host high-dimensional global minima. It develops Jacobian-rank based arguments and combinatorial region counting to characterize when differentiable critical points are global optima, and provides explicit results for shallow two-layer networks as well as one-dimensional inputs. The results extend to deep networks under mild width conditions and include volume-based bounds via anticoncentration arguments, with experimental evidence showing phase transitions in the prevalence of full-rank Jacobians. Collectively, the work suggests that realistic levels of overparameterization yield substantially benign optimization landscapes, independent of initialization or data distribution, though open questions remain for intermediate widths and deeper architectures.

Abstract

We study the loss landscape of both shallow and deep, mildly overparameterized ReLU neural networks on a generic finite input dataset for the squared error loss. We show both by count and volume that most activation patterns correspond to parameter regions with no bad local minima. Furthermore, for one-dimensional input data, we show most activation regions realizable by the network contain a high dimensional set of global minima and no bad local minima. We experimentally confirm these results by finding a phase transition from most regions having full rank Jacobian to many regions having deficient rank depending on the amount of overparameterization.

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

TL;DR

Abstract

Paper Structure (25 sections, 25 theorems, 170 equations, 9 figures)

This paper contains 25 sections, 25 theorems, 170 equations, 9 figures.

Introduction
Contributions
Relation to prior works
Preliminaries
Shallow ReLU networks: counting activation regions with bad local minima
Shallow univariate ReLU networks: activation regions with global vs local minima
Nonsmooth critical points
Extension to deep networks
Volumes of activation regions
One-dimensional input data
Arbitrary dimension input data
Experiments
Conclusion
Reproducibility statement
Details on counting activation regions with no bad local minima
...and 10 more sections

Key Result

Lemma 1

Let $\mathcal{V}$ be a proper algebraic subset of $\mathbb{R}^n$. Then $\mathcal{V}$ has Lebesgue measure $0$.

Figures (9)

Figure 1: The probability of the Jacobian being of full rank from a random initialization for various values of $d_1$ and $n$, where the input dimension $d_0$ is left fixed.
Figure 2: The probability of the Jacobian being of full rank for various values of $d_1$ and $n$, where the input dimension $d_0$ scales linearly in the number of samples $n$.
Figure 3: Illustration of Proposition \ref{['prop:identity-non-empty']}. The polytope $P$ for a ReLU on three data points $x^{(1)}, x^{(2)}, x^{(3)}$ is the Minkowski sum of the line segments $P_i=\operatorname{conv}\{0,x^{(i)}\}$ highlighted in red. The activation regions in parameter space are the normal cones of $P$ at its different vertices. Hence the vertices correspond to the non-empty activation regions. These are naturally labeled by vectors $\mathbf{1}_S$ that indicate which $x^{(i)}$ are added to produce the vertex and record the activation patterns.
Figure 4: Function space of a ReLU on $n$ data points in $\mathbb{R}$, for $n=3,4$. The function space is a polyhedral cone in the non-negative orthant of $\mathbb{R}^n$. We can represent this, up to scaling by non-negative factors, by functions $f=(f_1,\ldots, f_n)$ with $f_1+\cdots+f_n=1$. These form a polyline, shown in red, inside the $(n-1)$-simplex. The sum of $m$ ReLUs corresponds to non-negative multiples of convex combinations of any $m$ points in the polyline, and arbitrary linear combinations of $m$ ReLUs correspond to arbitrary scalar multiples of affine combinations of any $m$ points in this polyline.
Figure 5: Subdivision of the parameter space of a single ReLU on two data points $x^{(1)}, x^{(2)}$ in $1\times \mathbb{R}^1$ by values of the Jacobian (left) and corresponding pieces of the function space in $\mathbb{R}^2$ (right). The activation regions are intersections of half-spaces with activation patterns indicating the positive ones or, equivalently, the indices of data points where the unit is active.
...and 4 more figures

Theorems & Definitions (48)

Lemma 1
Lemma 2
proof
Theorem 3
Lemma 4
Theorem 5
Proposition 6: Number of non-empty regions
Corollary 7
Proposition 8: Identity of non-empty regions
Lemma 9
...and 38 more

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

TL;DR

Abstract

Mildly Overparameterized ReLU Networks Have a Favorable Loss Landscape

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (48)