Table of Contents
Fetching ...

Noise-induced degeneration in online learning

Yuzuru Sato, Daiji Tsutsui, Akio Fujiwara

TL;DR

This work analyzes plateau phenomena in online SGD for a minimal three-layer perceptron (Fukumizu-Amari model) through the lens of random dynamical systems. It shows that, under finite data, SGD trajectories are globally attracted to degenerate subspaces and can exhibit noise-induced degeneration that further confines dynamics to multiply degenerated manifolds, with local attraction governed by a two-dimensional map. A key finding is the existence of an optimal noise level that minimizes the escape time from the degenerated subspace, contrasting with traditional Kramers-type escape pictures. The results suggest that degeneration and noise interactions are fundamental to online learning dynamics and may help explain generalization and behavior in larger, deeper networks.

Abstract

In order to elucidate the plateau phenomena caused by vanishing gradient, we herein analyse stability of stochastic gradient descent near degenerated subspaces in a multi-layer perceptron. In stochastic gradient descent for Fukumizu-Amari model, which is the minimal multi-layer perceptron showing non-trivial plateau phenomena, we show that (1) attracting regions exist in multiply degenerated subspaces, (2) a strong plateau phenomenon emerges as a noise-induced synchronisation, which is not observed in deterministic gradient descent, (3) an optimal fluctuation exists to minimise the escape time from the degenerated subspace. The noise-induced degeneration observed herein is expected to be found in a broad class of machine learning via neural networks.

Noise-induced degeneration in online learning

TL;DR

This work analyzes plateau phenomena in online SGD for a minimal three-layer perceptron (Fukumizu-Amari model) through the lens of random dynamical systems. It shows that, under finite data, SGD trajectories are globally attracted to degenerate subspaces and can exhibit noise-induced degeneration that further confines dynamics to multiply degenerated manifolds, with local attraction governed by a two-dimensional map. A key finding is the existence of an optimal noise level that minimizes the escape time from the degenerated subspace, contrasting with traditional Kramers-type escape pictures. The results suggest that degeneration and noise interactions are fundamental to online learning dynamics and may help explain generalization and behavior in larger, deeper networks.

Abstract

In order to elucidate the plateau phenomena caused by vanishing gradient, we herein analyse stability of stochastic gradient descent near degenerated subspaces in a multi-layer perceptron. In stochastic gradient descent for Fukumizu-Amari model, which is the minimal multi-layer perceptron showing non-trivial plateau phenomena, we show that (1) attracting regions exist in multiply degenerated subspaces, (2) a strong plateau phenomenon emerges as a noise-induced synchronisation, which is not observed in deterministic gradient descent, (3) an optimal fluctuation exists to minimise the escape time from the degenerated subspace. The noise-induced degeneration observed herein is expected to be found in a broad class of machine learning via neural networks.

Paper Structure

This paper contains 16 sections, 2 theorems, 51 equations, 5 figures.

Key Result

Theorem 1

Assuming that for a sufficiently small laerning rate $\eta>0$, the dynamics of $s(t)$ is contracting and approaching 0.

Figures (5)

  • Figure 1: A schematic view of the plateau phenomena (left) and a stagnant dynamics near the degenerated subspace (right). The dynamics of learning slows down near the attracting region in the degenerated subspace, but eventually escapes to the optimal.
  • Figure 2: The three-layer perceptron: The nodes are activation functions given by $\tanh(\cdot)$, and each edge indicates a linear superposition with parameters $(w_1, w_2, v_1, v_2)$. The output $y$ is a function of the input $x$ and the parameters $(w_1, w_2, v_1, v_2)$.
  • Figure 3: (Left) The eigenvalues $\mu_+(x;1/2,1/2)$ (red), and $\mu_-(x;1/2,1/2)$ (blue) of $J(x;1/2,1/2)$ as well as the distribution $\rho(x)$ are depicted as functions of $x$. The parameters are set as $T(x)=2\tanh(x)-\tanh(4x)$, $\eta=0.1$, and $\sigma^2$=0.1,1. When the fluctuation $\sigma^2$ is small, $x$ is frequently sampled near zero, and the point $(w,v)=(1/2,1/2)$ is a saddle point; otherwise, it is an attracting point. (right) The probability $\pi(x;w,v)={\rm Prob}[\mu_+(x;w,v)\leq 0]$ for $\sigma^2=1$ plotted in $[0,2]^2$ on $M_{wv}$. The red curve $C$ indicates the valley formed by steep gradients of the averaged potential (see Appendix A).
  • Figure 4: Finite time pullback attractors (see appendix C) with the pullback time $\tau=1 000$, $\tau=10 000$, $\tau=30 000$, and $\tau=100 000$ in the full space $\bm{\Theta}$. Parameters are set as $T(x)=2\tanh(x)-\tanh(4x)$, $\eta=0.1$, and $\sigma^2=0.1$ (left) and $\sigma^2=1.0$ (right). The red and blue dots represent paths of $(w_1,w_2)$ and $(v_1,v_2)$, respectively, starting from different initial conditions. Both dynamics are plotted together in each panel. The grey points correspond to the optimal attractors $\bm{\theta}^*$. The degenerated subspace $w_1=w_2$ and $v_1=v_2$ are depicted as a single line. A typical noise realisation $\{x\}$ is fixed and dynamics is developed with $10^5$ different initial conditions $\bm{\theta}(0)\in[-1,1]^4$. When $\sigma^2=1.0$, the trapping dynamics near $M_{wv}$ is observed in the attracting region indicated by a dashed circle.
  • Figure 5: The averaged escape time from $[-2,2]\times[-2,2]$, $\tau^*$ as a function of $\sigma^2$ in log-log plot. The other parameters are same as in the numerical computation in Fig. 4. Stronger noise induces longer escape time because of noise-induced degeneration. An optimal fluctuation size is $\sigma^2\simeq 0.07$.

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Theorem 2
  • proof