In almost all shallow analytic neural network optimization landscapes, efficient minimizers have strongly convex neighborhoods
Felix Benning, Steffen Dereich
TL;DR
This work analyzes the optimization landscapes of shallow analytic neural networks for regression, showing that, after partitioning parameters into an efficient domain where all hidden units are utilized and a redundant, lower-dimensional domain, the mean-squared-error objective is almost surely a Morse function on the efficient domain. The authors develop a rigorous framework combining a decomposition of the objective, Gaussian-process analysis of the target function, and an analytic construction that ties Hessian degeneracy to a thin set, proving Morse-ness on polynomially efficient parameters and establishing domain-equivalence results for common activations such as sigmoid and tanh. They characterize redundancies, prove polynomial-slicing results to extend univariate independence to the multivariate case, and prove that efficient and redundant critical points exist with positive probability, including constructive pruning arguments that preserve or remove redundancies while retaining criticality. The results imply that, in practice, SGD-like optimization on the efficient domain encounters well-behaved, nondegenerate local minima with fast local convergence, while redundant parameterizations can introduce non-isolated critical points and complicate the landscape. Overall, the paper provides a detailed probabilistic and geometric account of when shallow networks exhibit Morse landscapes, with implications for the design and analysis of training dynamics and architecture pruning strategies.
Abstract
Whether or not a local minimum of a cost function has a strongly convex neighborhood greatly influences the asymptotic convergence rate of optimizers. In this article, we rigorously analyze the prevalence of this property for the mean squared error induced by shallow, 1-hidden layer neural networks with analytic activation functions when applied to regression problems. The parameter space is divided into two domains: the 'efficient domain' (all parameters for which the respective realization function cannot be generated by a network having a smaller number of neurons) and the 'redundant domain' (the remaining parameters). In almost all regression problems on the efficient domain the optimization landscape only features local minima that are strongly convex. Formally, we will show that for certain randomly picked regression problems the optimization landscape is almost surely a Morse function on the efficient domain. The redundant domain has significantly smaller dimension than the efficient domain and on this domain, potential local minima are never isolated.
