Table of Contents
Fetching ...

New Evidence of the Two-Phase Learning Dynamics of Neural Networks

Zhanpeng Zhou, Yongyi Yang, Mahito Sugiyama, Junchi Yan

TL;DR

The paper investigates how deep neural networks learn by introducing an interval-wise analysis framework that compares network states across training time windows. It uncovers two robust, two-phase phenomena: the Chaos Effect, where small parameter perturbations before an inflection point can cause large divergences, and the Cone Effect, where after the transition the functional trajectory is confined to a narrow cone in function space as the eNTK evolves. The study uses CIFAR-10 experiments with VGG-16 and ResNet-20 and defines metrics such as parameter dissimilarity, kernel distance, loss barriers, and disagreement rate to characterize the dynamics. These findings offer a structural dynamical view of learning—late training remains nonlinear yet constrained—highlighting practical implications for training strategies and setting directions for future theoretical analysis.

Abstract

Understanding how deep neural networks learn remains a fundamental challenge in modern machine learning. A growing body of evidence suggests that training dynamics undergo a distinct phase transition, yet our understanding of this transition is still incomplete. In this paper, we introduce an interval-wise perspective that compares network states across a time window, revealing two new phenomena that illuminate the two-phase nature of deep learning. i) \textbf{The Chaos Effect.} By injecting an imperceptibly small parameter perturbation at various stages, we show that the response of the network to the perturbation exhibits a transition from chaotic to stable, suggesting there is an early critical period where the network is highly sensitive to initial conditions; ii) \textbf{The Cone Effect.} Tracking the evolution of the empirical Neural Tangent Kernel (eNTK), we find that after this transition point the model's functional trajectory is confined to a narrow cone-shaped subset: while the kernel continues to change, it gets trapped into a tight angular region. Together, these effects provide a structural, dynamical view of how deep networks transition from sensitive exploration to stable refinement during training.

New Evidence of the Two-Phase Learning Dynamics of Neural Networks

TL;DR

The paper investigates how deep neural networks learn by introducing an interval-wise analysis framework that compares network states across training time windows. It uncovers two robust, two-phase phenomena: the Chaos Effect, where small parameter perturbations before an inflection point can cause large divergences, and the Cone Effect, where after the transition the functional trajectory is confined to a narrow cone in function space as the eNTK evolves. The study uses CIFAR-10 experiments with VGG-16 and ResNet-20 and defines metrics such as parameter dissimilarity, kernel distance, loss barriers, and disagreement rate to characterize the dynamics. These findings offer a structural dynamical view of learning—late training remains nonlinear yet constrained—highlighting practical implications for training strategies and setting directions for future theoretical analysis.

Abstract

Understanding how deep neural networks learn remains a fundamental challenge in modern machine learning. A growing body of evidence suggests that training dynamics undergo a distinct phase transition, yet our understanding of this transition is still incomplete. In this paper, we introduce an interval-wise perspective that compares network states across a time window, revealing two new phenomena that illuminate the two-phase nature of deep learning. i) \textbf{The Chaos Effect.} By injecting an imperceptibly small parameter perturbation at various stages, we show that the response of the network to the perturbation exhibits a transition from chaotic to stable, suggesting there is an early critical period where the network is highly sensitive to initial conditions; ii) \textbf{The Cone Effect.} Tracking the evolution of the empirical Neural Tangent Kernel (eNTK), we find that after this transition point the model's functional trajectory is confined to a narrow cone-shaped subset: while the kernel continues to change, it gets trapped into a tight angular region. Together, these effects provide a structural, dynamical view of how deep networks transition from sensitive exploration to stable refinement during training.

Paper Structure

This paper contains 18 sections, 6 equations, 6 figures.

Figures (6)

  • Figure 1: An illustration of the two-phase training dynamics. The optimization trajectory initially passes through a chaotic training phase, termed as the chaos effect; then undergoes a more stable, non-chaotic regime, namely the cone effect.
  • Figure 2: The illustration of the injected perturbation.$\boldsymbol{\theta}_0$ denotes the initialization. For both $\boldsymbol{\theta}$ and $\boldsymbol{\theta}'$, the same stochastic gradient noise are applied during training.
  • Figure 3: The sensitivity of learning dynamics to tiny perturbations. We apply the perturbation $\boldsymbol\epsilon$ at the time point $t_0$ and compare resulting models at $t_1$ with different metrics. We set $\left\lVert\boldsymbol{\epsilon}\right\rVert_0 = 10^{-7}$. Our results are reported for VGG-16 and ResNet-20 on CIFAR-10. (a) The parameter dissimilarity $(\boldsymbol{C})_{t_0, t_1}$. (b) The loss barrier $(\boldsymbol{B})_{t_0, t_1}$. (c) The disagreement rate $(\boldsymbol{D})_{t_0, t_1}$. Note that the $t_0$ and $t_1$ are presented in iterations, not epochs.
  • Figure 4: The kernel distance between every pair of two points at the optimization trajectory $\{\boldsymbol{\theta}_t\}_{t=1}^T$. Our results are reported for both VGG-16 and ResNet-20 on CIFAR-10. Note that the $i$ and $j$ are presented in iterations, not epochs.
  • Figure 5: Constrained learning dynamics in the second phase.(a) The kernel distance between the current iterate $\boldsymbol{\theta}_t$ and a reference point $\boldsymbol{\theta}_{\tau}$ v.s. training iteration $t$, where $\tau$ is varied. (b) The kernel distance between two adjacent iterates $\boldsymbol{\theta}_t$ and $\boldsymbol{\theta}_{t+dt}$ vs. training iteration $t$, where $dt$ is varied. (c) The visualization of the changes of the eNTK matrices $\boldsymbol{H}(\boldsymbol{\theta}_t)$. The black dot represents position of the eNTK matrix at initialization, i.e., $\boldsymbol{H}(\boldsymbol{\theta}_0)$. The other dot represents the relative position of $\boldsymbol{H}(\boldsymbol{\theta}_t)$ at $t>0$, with darker color indicating larger iteration.
  • ...and 1 more figures