Table of Contents
Fetching ...

Deep Learning without Poor Local Minima

Kenji Kawaguchi

TL;DR

The paper resolves a long-standing conjecture by proving that deep linear networks exhibit a non-convex but tractable loss surface where every local minimum is global and all non-global critical points are saddles; it also reveals depth-dependent saddle-point pathology, with 'bad' saddles arising only in deeper architectures. It extends these insights to deep nonlinear networks via a reduction to the linear case under weakened independence assumptions, thereby showing similar non-convex landscape properties for a broad class of nonlinear nets. The work clarifies the theoretical difficulty of training deep models—non-convexity remains, but the absence of poor local minima makes optimization more feasible than NP-hard pessimism suggests—while highlighting open problems about bad saddle points and the practical gap. Overall, the results advance the theoretical foundations of deep learning optimization and suggest avenues for designing training methods that exploit the landscape structure.

Abstract

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

Deep Learning without Poor Local Minima

TL;DR

The paper resolves a long-standing conjecture by proving that deep linear networks exhibit a non-convex but tractable loss surface where every local minimum is global and all non-global critical points are saddles; it also reveals depth-dependent saddle-point pathology, with 'bad' saddles arising only in deeper architectures. It extends these insights to deep nonlinear networks via a reduction to the linear case under weakened independence assumptions, thereby showing similar non-convex landscape properties for a broad class of nonlinear nets. The work clarifies the theoretical difficulty of training deep models—non-convexity remains, but the absence of poor local minima makes optimization more feasible than NP-hard pessimism suggests—while highlighting open problems about bad saddle points and the practical gap. Overall, the results advance the theoretical foundations of deep learning optimization and suggest avenues for designing training methods that exploit the landscape structure.

Abstract

In this paper, we prove a conjecture published in 1989 and also partially address an open problem announced at the Conference on Learning Theory (COLT) 2015. With no unrealistic assumption, we first prove the following statements for the squared loss function of deep linear neural networks with any depth and any widths: 1) the function is non-convex and non-concave, 2) every local minimum is a global minimum, 3) every critical point that is not a global minimum is a saddle point, and 4) there exist "bad" saddle points (where the Hessian has no negative eigenvalue) for the deeper networks (with more than three layers), whereas there is no bad saddle point for the shallow networks (with three layers). Moreover, for deep nonlinear neural networks, we prove the same four statements via a reduction to a deep linear model under the independence assumption adopted from recent work. As a result, we present an instance, for which we can answer the following question: how difficult is it to directly train a deep model in theory? It is more difficult than the classical machine learning models (because of the non-convexity), but not too difficult (because of the nonexistence of poor local minima). Furthermore, the mathematically proven existence of bad saddle points for deeper models would suggest a possible open problem. We note that even though we have advanced the theoretical foundations of deep learning and non-convex optimization, there is still a gap between theory and practice.

Paper Structure

This paper contains 34 sections, 11 theorems, 20 equations.

Key Result

Proposition 2.1

(baldi1989neural: shallow linear network) Assume that $H=1$ (i.e., $\overline Y(W,X)=W_2W_1X$), assume that $XX^T$ and $XY^T$ are invertible, assume that $\Sigma$ has $d_y$ distinct eigenvalues, and assume that $p<d_x$, $p<d_y$ and $d_y=d_x$ (e.g., an autoencoder). Then, the loss function $\mathcal{

Theorems & Definitions (12)

  • Proposition 2.1
  • Conjecture 2.2
  • Theorem 2.3
  • Corollary 2.4
  • Proposition 3.1
  • Corollary 3.2
  • Lemma 4.1
  • Lemma 4.2
  • Lemma 4.3
  • Lemma 4.4
  • ...and 2 more