Table of Contents
Fetching ...

Dynamical structure of vanishing gradient and overfitting in multi-layer perceptrons

Alex Alì Maleknia, Yuzuru Sato

Abstract

Vanishing gradient and overfitting are two of the most extensively studied problems in the literature about machine learning. However, they are frequently considered in some asymptotic setting, which obscure the underlying dynamical mechanisms responsible for their emergence. In this paper, we aim to provide a clear dynamical description of learning in multi-layer perceptrons. To this end, we introduce a minimal model, inspired by studies by Fukumizu and Amari, to investigate vanishing gradients and overfitting in MLPs trained via gradient descent. Within this model, we show that the learning dynamics may pass through plateau regions and near-optimal regions during training, both of which consist of saddle structures, before ultimately converging to the overfitting region. Under suitable conditions on the training dataset, we prove that, with high probability, the overfitting region collapses to a single attractor modulo symmetry, which corresponds to the overfitting. Moreover, we show that any MLP trained on a finite noisy dataset cannot converge to the theoretical optimum and instead necessarily converges to an overfitting solution.

Dynamical structure of vanishing gradient and overfitting in multi-layer perceptrons

Abstract

Vanishing gradient and overfitting are two of the most extensively studied problems in the literature about machine learning. However, they are frequently considered in some asymptotic setting, which obscure the underlying dynamical mechanisms responsible for their emergence. In this paper, we aim to provide a clear dynamical description of learning in multi-layer perceptrons. To this end, we introduce a minimal model, inspired by studies by Fukumizu and Amari, to investigate vanishing gradients and overfitting in MLPs trained via gradient descent. Within this model, we show that the learning dynamics may pass through plateau regions and near-optimal regions during training, both of which consist of saddle structures, before ultimately converging to the overfitting region. Under suitable conditions on the training dataset, we prove that, with high probability, the overfitting region collapses to a single attractor modulo symmetry, which corresponds to the overfitting. Moreover, we show that any MLP trained on a finite noisy dataset cannot converge to the theoretical optimum and instead necessarily converges to an overfitting solution.

Paper Structure

This paper contains 9 sections, 5 theorems, 21 equations, 3 figures.

Key Result

Proposition 3.1

For every $m \geq m^*$, $\mathcal{M}_m$ does not contain any critical points of $L$ for almost every realization of the data noise vector $\xi=(\xi_1,\dots,\xi_n)$. Moreover, $L$ is constant over $\mathcal{M}_m$ and follows the distribution $\frac{\tau^2}{2n}\chi^2(n)$. $\blacktriangleleft$$\blacktr

Figures (3)

  • Figure 1: A multi-layer perceptron with two hidden layers of arbitrary size.
  • Figure 2: A schematic representation of the saddle-saddle-attractor scenario in MLP gradient descent learning. Empirically, the number of positive eigenvalues is smaller near the optimal region than in the plateau region. The overfitting is a stable attractor.
  • Figure 3: Graphs obtained after training the minimal model for 2 million iterations. In the first column we represent the learning curve; on the right there is a representation of the parameters' orbits during the training, where red stars are the overfitting points and grey lines model the singular region (dark grey) and optimal region (light gray). Plots in (a) represent the training without observational noise (i.e., $\tau=0$), while in (b) we added some small noise with $\tau=0.2$.

Theorems & Definitions (11)

  • Definition 1
  • Definition 2
  • Proposition 3.1
  • proof
  • Proposition 3.2
  • proof
  • Corollary 3.1
  • proof
  • Theorem 3.1
  • proof
  • ...and 1 more