Table of Contents
Fetching ...

On the Quality of the Initial Basin in Overspecified Neural Networks

Itay Safran, Ohad Shamir

TL;DR

This work investigates the geometric landscape of training objectives for ReLU neural networks, focusing on how overspecification alters optimization tractability from random starts.It establishes that, for networks of any depth, a monotone-descent path to a global minimum can be constructed under mild convexity and data conditions, and that random initializations tend to satisfy the required high-loss starting point as width grows.For two-layer nets, the authors introduce a basin-based decomposition showing that overspecification substantially reduces the likelihood of poor local minima, and demonstrate that data structure—low intrinsic dimension or clustering—further amplifies these favorable properties.Overall, the results provide a geometric explanation for why larger networks can be easier to train in practice and connect to prior empirical observations about optimization behavior in deep learning.

Abstract

Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications. However, a theoretical explanation for this remains a major open problem, since training neural networks involves optimizing a highly non-convex objective function, and is known to be computationally hard in the worst case. In this work, we study the \emph{geometric} structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters. We identify some conditions under which it becomes more favorable to optimization, in the sense of (i) High probability of initializing at a point from which there is a monotonically decreasing path to a global minimum; and (ii) High probability of initializing at a basin (suitably defined) with a small minimal objective value. A common theme in our results is that such properties are more likely to hold for larger ("overspecified") networks, which accords with some recent empirical and theoretical observations.

On the Quality of the Initial Basin in Overspecified Neural Networks

TL;DR

This work investigates the geometric landscape of training objectives for ReLU neural networks, focusing on how overspecification alters optimization tractability from random starts.It establishes that, for networks of any depth, a monotone-descent path to a global minimum can be constructed under mild convexity and data conditions, and that random initializations tend to satisfy the required high-loss starting point as width grows.For two-layer nets, the authors introduce a basin-based decomposition showing that overspecification substantially reduces the likelihood of poor local minima, and demonstrate that data structure—low intrinsic dimension or clustering—further amplifies these favorable properties.Overall, the results provide a geometric explanation for why larger networks can be easier to train in practice and connect to prior empirical observations about optimization behavior in deep learning.

Abstract

Deep learning, in the form of artificial neural networks, has achieved remarkable practical success in recent years, for a variety of difficult machine learning applications. However, a theoretical explanation for this remains a major open problem, since training neural networks involves optimizing a highly non-convex objective function, and is known to be computationally hard in the worst case. In this work, we study the \emph{geometric} structure of the associated non-convex objective function, in the context of ReLU networks and starting from a random initialization of the network parameters. We identify some conditions under which it becomes more favorable to optimization, in the sense of (i) High probability of initializing at a point from which there is a monotonically decreasing path to a global minimum; and (ii) High probability of initializing at a basin (suitably defined) with a small minimal objective value. A common theme in our results is that such properties are more likely to hold for larger ("overspecified") networks, which accords with some recent empirical and theoretical observations.

Paper Structure

This paper contains 25 sections, 17 theorems, 115 equations, 3 figures.

Key Result

Theorem 1

Suppose $L:\mathbb{R}^{m\times k}\rightarrow\mathbb{R}$ is convex. Given a fully-connected network of any depth, with initialization point $\mathcal{W}^{\left( 0 \right)}$, suppose there exists a continuous path $\mathcal{W}^{\left( \lambda \right)},\lambda\in [0,1]$ in the space of parameter vector Then there exists a continuous path $\tilde{\mathcal{W}}^{\left( \lambda \right)},\lambda\in [0,1]$

Figures (3)

  • Figure 1: The partition of $\mathbb{R}^2$ into regions by the instances $\mathbf{c}_1=\left( 1,1 \right),\mathbf{c}_2=\left( -2,0.5 \right)$, and the corresponding partition by clustered instances with centers $\mathbf{c}_1,\mathbf{c}_2$. The noisy regions are depicted by the light blue and light red.
  • Figure 2: Plot of $L_S\left( w \right)$ for $\epsilon=0.1$.
  • Figure 3: Plot of $L_S\left( w \right)$ after extending the sample to 2 dimensions. The surface contains one optimal minimum, another bad minimum and 2 average valued minima.

Theorems & Definitions (38)

  • Definition 1
  • Theorem 1
  • Proposition 1
  • Proposition 2
  • Definition 2
  • Lemma 1
  • Lemma 2
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • ...and 28 more