Table of Contents
Fetching ...

Hidden Minima in Two-Layer ReLU Networks

Yossi Arjevani

TL;DR

The paper analyzes the optimization landscape of two-layer ReLU networks under squared loss, revealing two infinite families of spurious minima (type I hidden minima and type II) whose Hessian spectra agree up to O(d^{-1/2}).It develops a framework based on tangency sets, o-minimal definability, and group representation theory to classify critical-point arcs by symmetry (isotropy) and to capture their Puiseux-series structure in 1/d.Key contributions include precise descriptions of tangency-arc structure for type I and II minima, explicit isotypic decompositions of the parameter space, and leading-term Hessian analyses that distinguish hidden from detectable minima via symmetry-breaking arcs.Numerical investigations validate the theoretical predictions by tracing tangency arcs and bounding distances to nearby critical points, illustrating the practical relevance of symmetry-based arguments for understanding nonconvex optimization in neural networks.

Abstract

We consider the optimization problem arising from fitting two-layer ReLU networks with $d$ inputs under the square loss, where labels are generated by a target network. Two infinite families of spurious minima have recently been identified: one whose loss vanishes as $d \to \infty$, and another whose loss remains bounded away from zero. The latter are nevertheless avoided by vanilla SGD, and thus hidden, motivating the search for analytic properties distinguishing the two types. Perhaps surprisingly, the Hessian spectra of hidden and non-hidden minima agree up to terms of order $O(d^{-1/2})$, providing limited explanatory power. Consequently, our analysis of hidden minima proceeds instead via curves along which the loss is minimized or maximized. The main result is that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the $O(d^{-1/2})$-eigenvalue terms absent from previous analyses.

Hidden Minima in Two-Layer ReLU Networks

TL;DR

The paper analyzes the optimization landscape of two-layer ReLU networks under squared loss, revealing two infinite families of spurious minima (type I hidden minima and type II) whose Hessian spectra agree up to O(d^{-1/2}).It develops a framework based on tangency sets, o-minimal definability, and group representation theory to classify critical-point arcs by symmetry (isotropy) and to capture their Puiseux-series structure in 1/d.Key contributions include precise descriptions of tangency-arc structure for type I and II minima, explicit isotypic decompositions of the parameter space, and leading-term Hessian analyses that distinguish hidden from detectable minima via symmetry-breaking arcs.Numerical investigations validate the theoretical predictions by tracing tangency arcs and bounding distances to nearby critical points, illustrating the practical relevance of symmetry-based arguments for understanding nonconvex optimization in neural networks.

Abstract

We consider the optimization problem arising from fitting two-layer ReLU networks with inputs under the square loss, where labels are generated by a target network. Two infinite families of spurious minima have recently been identified: one whose loss vanishes as , and another whose loss remains bounded away from zero. The latter are nevertheless avoided by vanilla SGD, and thus hidden, motivating the search for analytic properties distinguishing the two types. Perhaps surprisingly, the Hessian spectra of hidden and non-hidden minima agree up to terms of order , providing limited explanatory power. Consequently, our analysis of hidden minima proceeds instead via curves along which the loss is minimized or maximized. The main result is that arcs emanating from hidden minima differ, characteristically, by their structure and symmetry, precisely on account of the -eigenvalue terms absent from previous analyses.
Paper Structure (20 sections, 6 theorems, 20 equations, 2 tables)

This paper contains 20 sections, 6 theorems, 20 equations, 2 tables.

Key Result

Corollary 1

If $f$ is definable then there exists a tangency arc $\gamma$ parameterized by arc length satisfying ${\mathcal{L}}(\gamma(r)) = m(r)$, and similarly for $M(r)$.

Theorems & Definitions (10)

  • Remark 1
  • Definition 1
  • Corollary 1
  • Lemma 1
  • Theorem 1
  • Corollary 2
  • Definition 2
  • Theorem 2
  • Definition 3
  • Lemma 2: arjevanifield2020hessian