Table of Contents
Fetching ...

Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions

Wenbo Wei, Nicholas Chong Jia Le, Choy Heng Lai, Ling Feng

TL;DR

This study investigates training dynamics in deep learning by applying an asymptotic stability framework to LSTMs trained on IMDb sentiment analysis. It reveals multiple descents—cycles of rising and sharply dropping test loss—that align with order-chaos transitions, with the global optimum at the first transition where the edge of chaos is widest. The approach links dynamical systems concepts to training behavior and suggests epoch-level strategies to exploit chaotic regimes for improved generalization, potentially extending beyond LSTMs.

Abstract

We observe a novel 'multiple-descent' phenomenon during the training process of LSTM, in which the test loss goes through long cycles of up and down trend multiple times after the model is overtrained. By carrying out asymptotic stability analysis of the models, we found that the cycles in test loss are closely associated with the phase transition process between order and chaos, and the local optimal epochs are consistently at the critical transition point between the two phases. More importantly, the global optimal epoch occurs at the first transition from order to chaos, where the 'width' of the 'edge of chaos' is the widest, allowing the best exploration of better weight configurations for learning.

Multiple Descents in Deep Learning as a Sequence of Order-Chaos Transitions

TL;DR

This study investigates training dynamics in deep learning by applying an asymptotic stability framework to LSTMs trained on IMDb sentiment analysis. It reveals multiple descents—cycles of rising and sharply dropping test loss—that align with order-chaos transitions, with the global optimum at the first transition where the edge of chaos is widest. The approach links dynamical systems concepts to training behavior and suggests epoch-level strategies to exploit chaotic regimes for improved generalization, potentially extending beyond LSTMs.

Abstract

We observe a novel 'multiple-descent' phenomenon during the training process of LSTM, in which the test loss goes through long cycles of up and down trend multiple times after the model is overtrained. By carrying out asymptotic stability analysis of the models, we found that the cycles in test loss are closely associated with the phase transition process between order and chaos, and the local optimal epochs are consistently at the critical transition point between the two phases. More importantly, the global optimal epoch occurs at the first transition from order to chaos, where the 'width' of the 'edge of chaos' is the widest, allowing the best exploration of better weight configurations for learning.

Paper Structure

This paper contains 9 sections, 4 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: Illustration of our methodology to iterate the LSTM cell to get the asymptotic value of output neuron $\boldsymbol{h_T}$, where $T$ is large at $T=1599$. The first 500 iterations use words from the movie reviews as input, after which only $\boldsymbol{0}$ vectors are used as LSTM cell inputs to extract the order/chaos properties of the model.
  • Figure 2: Multiple descents through a sequence of order chaos transitions during the training process of LSTM. (a) The average asymptotic log distances $\tilde{D}$ (green) under perturbation is used to indicate order/chaos states. The optimal epoch of lstm-1 is 114 with an accuracy of 88.34%. Multiple descents are seen in the overfitting regime at epochs $>450$. When the asymptotic distance is at $-15$, it means two slightly different initial input values will converge to the same value at long enough iterations of the LSTM cell, indicating order phase. If the asymptotic distance is large, it means the model is at chaotic phase. (b) The 'bifurcation map' (blue) is shown together with the test loss (brown). The 'bifurcation map' is drawn by plotting the reduced sum $\boldsymbol h_{1599} \cdot \boldsymbol{1}$ for each of the 500 review samples at every epoch. Note that within every epoch, the average of all 500 reduced sums has been subtracted from each reduced sum value for the ease of visualization. Similarly, if the different samples converge to the same value, it indicates order phase. If the samples spread out, it indicates chaotic phase.
  • Figure 3: The $\tanh$ bifurcation map in equation (3) showing the asymptotic distances (green) and the order/chaos (a.k.a. bifurcation) diagram (blue). 500 random initial values of $k_0$ are used.
  • Figure 4: Relationship between test loss of wrong predictions and chaos (as measured by asymptotic distance). (a) illustrates the overfitting regime before clear multiple descents happen, i.e., between epoch 115 and 495. (b) illustrates the region where clear multiple descents occur, i.e., after epoch 495.
  • Figure 5: Two other experiments with the same setting showing multiple descents, and best epoch occurring at the first order to chaos transition at (a) epoch 190 and (b) epoch 120.
  • ...and 1 more figures