Table of Contents
Fetching ...

In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods

Felix Benning, Steffen Dereich

TL;DR

This work analyzes the optimization landscapes of shallow analytic neural networks for regression, showing that, after partitioning parameters into an efficient domain where all hidden units are utilized and a redundant, lower-dimensional domain, the mean-squared-error objective is almost surely a Morse function on the efficient domain. The authors develop a rigorous framework combining a decomposition of the objective, Gaussian-process analysis of the target function, and an analytic construction that ties Hessian degeneracy to a thin set, proving Morse-ness on polynomially efficient parameters and establishing domain-equivalence results for common activations such as sigmoid and tanh. They characterize redundancies, prove polynomial-slicing results to extend univariate independence to the multivariate case, and prove that efficient and redundant critical points exist with positive probability, including constructive pruning arguments that preserve or remove redundancies while retaining criticality. The results imply that, in practice, SGD-like optimization on the efficient domain encounters well-behaved, nondegenerate local minima with fast local convergence, while redundant parameterizations can introduce non-isolated critical points and complicate the landscape. Overall, the paper provides a detailed probabilistic and geometric account of when shallow networks exhibit Morse landscapes, with implications for the design and analysis of training dynamics and architecture pruning strategies.

Abstract

Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.

In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods

TL;DR

This work analyzes the optimization landscapes of shallow analytic neural networks for regression, showing that, after partitioning parameters into an efficient domain where all hidden units are utilized and a redundant, lower-dimensional domain, the mean-squared-error objective is almost surely a Morse function on the efficient domain. The authors develop a rigorous framework combining a decomposition of the objective, Gaussian-process analysis of the target function, and an analytic construction that ties Hessian degeneracy to a thin set, proving Morse-ness on polynomially efficient parameters and establishing domain-equivalence results for common activations such as sigmoid and tanh. They characterize redundancies, prove polynomial-slicing results to extend univariate independence to the multivariate case, and prove that efficient and redundant critical points exist with positive probability, including constructive pruning arguments that preserve or remove redundancies while retaining criticality. The results imply that, in practice, SGD-like optimization on the efficient domain encounters well-behaved, nondegenerate local minima with fast local convergence, while redundant parameterizations can introduce non-isolated critical points and complicate the landscape. Overall, the paper provides a detailed probabilistic and geometric account of when shallow networks exhibit Morse landscapes, with implications for the design and analysis of training dynamics and architecture pruning strategies.

Abstract

Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.

Paper Structure

This paper contains 16 sections, 27 theorems, 181 equations.

Key Result

Theorem 1.7

Let $(\mathfrak{N}, \mathfrak R,R, \mathbf{M})$ be a standard setting (Definition def: standard model). Assume $\psi\in \{\mathop{\mathrm{\normalfont{sigmoid}}}\nolimits, \tanh\}$ about the activation function and that the support of $\mathbb{P}_X$ contains a non-empty open set. Almost surely, the r is a Morse function. Equivalently, it holds that

Theorems & Definitions (69)

  • Definition 1.1: Shallow neural network
  • Definition 1.2
  • Definition 1.3
  • Definition 1.4: Cost function
  • Definition 1.5: Family of $L^p$-integrable regression problems, target function
  • Definition 1.6
  • Theorem 1.7: Almost all optimization landscapes are Morse on the efficient domain
  • Remark 1.8: Generalization of the Gaussian assumption
  • Remark 1.9: Weak universality
  • Definition 2.1: Polynomial efficiency
  • ...and 59 more