Table of Contents
Fetching ...

On the loss landscape of a class of deep neural networks with no bad local valleys

Quynh Nguyen, Mahesh Chandra Mukkamala, Matthias Hein

TL;DR

By introducing a class of deep networks with skip-connections to the output and analytic activations, the authors prove the empirical cross-entropy loss \\Phi(U,V) has no bad local valleys and there exist uncountably many zero-training-error solutions. They show that from any initialization there exists a continuous path along which \\Phi is non-increasing and can be driven arbitrarily close to zero, implying no suboptimal strict minima and no local maxima for the considered losses. Empirically, SGD with these skip-output networks generalizes well on MNIST and CIFAR-10, while a random-feature baseline that fixes \\Psi(U) and optimizes only V overfits, illustrating SGD's implicit regularization. Overall, the work provides a practical framework to study implicit regularization in deep nets and positions skip-output architectures as useful benchmarks for loss-landscape analyses.

Abstract

We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.

On the loss landscape of a class of deep neural networks with no bad local valleys

TL;DR

By introducing a class of deep networks with skip-connections to the output and analytic activations, the authors prove the empirical cross-entropy loss \\Phi(U,V) has no bad local valleys and there exist uncountably many zero-training-error solutions. They show that from any initialization there exists a continuous path along which \\Phi is non-increasing and can be driven arbitrarily close to zero, implying no suboptimal strict minima and no local maxima for the considered losses. Empirically, SGD with these skip-output networks generalizes well on MNIST and CIFAR-10, while a random-feature baseline that fixes \\Psi(U) and optimizes only V overfits, illustrating SGD's implicit regularization. Overall, the work provides a practical framework to study implicit regularization in deep nets and positions skip-output architectures as useful benchmarks for loss-landscape analyses.

Abstract

We identify a class of over-parameterized deep neural networks with standard activation functions and cross-entropy loss which provably have no bad local valley, in the sense that from any point in parameter space there exists a continuous path on which the cross-entropy loss is non-increasing and gets arbitrarily close to zero. This implies that these networks have no sub-optimal strict local minima.

Paper Structure

This paper contains 21 sections, 6 theorems, 17 equations, 4 figures, 5 tables.

Key Result

Lemma 2.1

If $\Phi(U,V)<\frac{\log(2)}{N}$, then the training error is zero.

Figures (4)

  • Figure 1: An example loss landscape with bad local valleys (left) and without bad local valley (right).
  • Figure 2: Left: An example neural network represented as directed acyclic graph. Right: The same network with skip connections added from a subset of hidden neurons to the output layer. All neurons with the same color can have shared or non-shared weights.
  • Figure 3: Loss surface of a two-hidden-layer network on a small MNIST dataset.
  • Figure 4: Training progress of a $150$-layer neural network with and without skip-connections.

Theorems & Definitions (8)

  • Lemma 2.1
  • Lemma 3.2
  • Definition 3.3
  • Theorem 3.4
  • Lemma A.1
  • Proposition A.2
  • Definition C.2
  • Theorem C.3