Table of Contents
Fetching ...

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

Zheng-An Chen, Tao Luo, GuiHong Wang

TL;DR

This work provides a more detailed proof for the initial plateau, followed by a comprehensive analysis of the initial descent stage dynamics, and examines the factors facilitating the network's ability to overcome the prolonged secondary plateau.

Abstract

The multi-stage phenomenon in the training loss curves of neural networks has been widely observed, reflecting the non-linearity and complexity inherent in the training process. In this work, we investigate the training dynamics of neural networks (NNs), with particular emphasis on the small initialization regime, identifying three distinct stages observed in the loss curve during training: the initial plateau stage, the initial descent stage, and the secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges contributing to slow training during the plateau stages. While the proof and estimate for the emergence of the initial plateau were established in our previous work, the behaviors of the initial descent and secondary plateau stages had not been explored before. Here, we provide a more detailed proof for the initial plateau, followed by a comprehensive analysis of the initial descent stage dynamics. Furthermore, we examine the factors facilitating the network's ability to overcome the prolonged secondary plateau, supported by both experimental evidence and heuristic reasoning. Finally, to clarify the link between global training trends and local parameter adjustments, we use the Wasserstein distance to track the fine-scale evolution of weight amplitude distribution.

On Multi-Stage Loss Dynamics in Neural Networks: Mechanisms of Plateau and Descent Stages

TL;DR

This work provides a more detailed proof for the initial plateau, followed by a comprehensive analysis of the initial descent stage dynamics, and examines the factors facilitating the network's ability to overcome the prolonged secondary plateau.

Abstract

The multi-stage phenomenon in the training loss curves of neural networks has been widely observed, reflecting the non-linearity and complexity inherent in the training process. In this work, we investigate the training dynamics of neural networks (NNs), with particular emphasis on the small initialization regime, identifying three distinct stages observed in the loss curve during training: the initial plateau stage, the initial descent stage, and the secondary plateau stage. Through rigorous analysis, we reveal the underlying challenges contributing to slow training during the plateau stages. While the proof and estimate for the emergence of the initial plateau were established in our previous work, the behaviors of the initial descent and secondary plateau stages had not been explored before. Here, we provide a more detailed proof for the initial plateau, followed by a comprehensive analysis of the initial descent stage dynamics. Furthermore, we examine the factors facilitating the network's ability to overcome the prolonged secondary plateau, supported by both experimental evidence and heuristic reasoning. Finally, to clarify the link between global training trends and local parameter adjustments, we use the Wasserstein distance to track the fine-scale evolution of weight amplitude distribution.

Paper Structure

This paper contains 21 sections, 20 theorems, 108 equations, 3 figures, 2 tables.

Key Result

Theorem 4$^*$

(Informal statement of Corollaries cor::similar_fp and cor::descent_distribution: Amplitude distribution of weights are similar) For $\alpha>\frac{1}{2}$, with a high probability over the choice of initial parameter $\bm{\theta}(T_0)$, we have for any $t \in [T_{\rm{p}}, T_{\rm{d}}]$ where $T_{\rm{d}}$ and $T_{\rm{p}}$ are defined in Theorems thm...informal_1 and thm...informal_2, respectively.

Figures (3)

  • Figure 1: The behavior of training loss (panel (a)), norms of weights (panel (b)) and relative Wasserstein distance between weights (panel (c)) that is $W_2^{\rm{rel}}(\rho_{|a|},\rho_{\lVert\bm{w}\rVert}) = W_2 (\rho_{|a|},\rho_{\lVert\bm{w}\rVert})/\lVert\rho_{|a|}\rVert_2$ during training process. The red region is the initial plateau stage, the blue region is the initial descent stage, and the green region is the secondary plateau stage.
  • Figure 2: Descent time $T_{\rm{d}}$ with respect to $\alpha$ (panel (a)) and $m$ (panel (b))
  • Figure 3: Panel (a) and (b): Towards different target function, the curve of ratio of weight norms and the relative Wasserstein distance, respectively.

Theorems & Definitions (48)

  • Theorem 4$^*$
  • Remark 1
  • Lemma 1: Estimate of higher order terms
  • proof
  • Lemma 2: Growth lemma
  • Remark 2
  • proof
  • Theorem 1: Initial plateau stage
  • Remark 3
  • Remark 4
  • ...and 38 more