Table of Contents
Fetching ...

Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin

Akshay Kumar, Jarvis Haupt

TL;DR

The paper analyzes gradient-flow dynamics of $L$-positively homogeneous neural networks in the small-initialization regime, focusing on the phase after escaping the origin. It shows that post-escape trajectories closely follow a limiting path ${\mathbf p}(t)$ guided by a second-order positive KKT point of the Neural Correlation Function, enabling precise characterization of the first encountered saddle. For feed-forward homogeneous networks, sparsity patterns observed before escape are shown to persist after escape under zero-preserving-subset conditions, linking early feature-learning structure to later optimization dynamics. Although the analysis excludes ReLU due to the Lipschitz-gradient assumption, the work provides a rigorous, tractable description of a meaningful segment of gradient flow beyond the origin and offers empirical corroboration via numerical experiments.

Abstract

Recent works exploring the training dynamics of homogeneous neural network weights under gradient flow with small initialization have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. Building on this, the current paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin. Insights gained from this analysis are used to characterize the first saddle point encountered by gradient flow after escaping the origin. Also, it is shown that for homogeneous feed-forward neural networks, under certain conditions, the sparsity structure emerging among the weights before the escape is preserved after escaping the origin and until reaching the next saddle point.

Towards Understanding Gradient Flow Dynamics of Homogeneous Neural Networks Beyond the Origin

TL;DR

The paper analyzes gradient-flow dynamics of -positively homogeneous neural networks in the small-initialization regime, focusing on the phase after escaping the origin. It shows that post-escape trajectories closely follow a limiting path guided by a second-order positive KKT point of the Neural Correlation Function, enabling precise characterization of the first encountered saddle. For feed-forward homogeneous networks, sparsity patterns observed before escape are shown to persist after escape under zero-preserving-subset conditions, linking early feature-learning structure to later optimization dynamics. Although the analysis excludes ReLU due to the Lipschitz-gradient assumption, the work provides a rigorous, tractable description of a meaningful segment of gradient flow beyond the origin and offers empirical corroboration via numerical experiments.

Abstract

Recent works exploring the training dynamics of homogeneous neural network weights under gradient flow with small initialization have established that in the early stages of training, the weights remain small and near the origin, but converge in direction. Building on this, the current paper studies the gradient flow dynamics of homogeneous neural networks with locally Lipschitz gradients, after they escape the origin. Insights gained from this analysis are used to characterize the first saddle point encountered by gradient flow after escaping the origin. Also, it is shown that for homogeneous feed-forward neural networks, under certain conditions, the sparsity structure emerging among the weights before the escape is preserved after escaping the origin and until reaching the next saddle point.

Paper Structure

This paper contains 19 sections, 29 theorems, 415 equations, 6 figures.

Key Result

Lemma 1

The origin is a critical point of the optimization problem in loss_fn.

Figures (6)

  • Figure 1: We train a two-layer neural network with output ${\mathbf{v}}^\top\sigma({\mathbf{W}}{\mathbf{x}}) ,$ where $\sigma(x) = x^2$, and trainable weights ${\mathbf{v}}\in \mathbb{R}^{50},{\mathbf{W}} \in \mathbb{R}^{50 \times 20}$. The training set has $100$ points sampled uniformly from the unit sphere in ${\mathbb{R}}^{20}$. We minimize the square loss with respect to the output of a smaller two-layer neural network with two neurons and square activation. We train using gradient descent with small initial weights, as depicted in panel (a). Panel (b) shows the evolution of loss with iterations. Panels (c) and (d) depict the absolute value of weights at iteration $i_1$ and $i_2$ (marked in panel (b)), approximately just before escaping the origin and immediately after reaching the next saddle point, respectively (the gap between them is 5000 iterations). Panels (c) and (d) show that the sparsity structure emerging among the weights before escaping the origin is preserved until reaching the next saddle point.
  • Figure 2: The contour of the loss function in \ref{['loss_ex']} is in the background. The foreground contains evolution of $\bm{\psi}(t,\delta{\mathbf{w}}_0)$, for $\delta \in\{0.1,0.05,0.001\}$ and $t\in [0,3]$ (in red), and ${\mathbf{p}}(t)$, for $t\in [-1,1]$ (in green). The saddle point at $(2,0)$ and the global minimum at $(2,1)$ are marked with green and red dot respectively.
  • Figure 3: The weights corresponding to the dashed arrows, or equivalently, all the incoming and outgoing weights of the hidden neurons in gray, form a zero-preserving subset.
  • Figure 4: We train a three-layer neural network whose output is ${\mathbf{v}}^\top\sigma({\mathbf{W}}_2\sigma({\mathbf{W}}_1{\mathbf{x}})) ,$ where $\sigma(x) = x^2$ (square activation), and ${\mathbf{v}}\in \mathbb{R}^{20},{\mathbf{W}}_2,{\mathbf{W}}_1 \in \mathbb{R}^{20 \times 20}$ are the trainable weights. The sparsity structure is preserved upon escaping from the origin.
  • Figure 5: We train a two-layer neural network whose output is ${\mathbf{v}}^\top\sigma({\mathbf{W}}{\mathbf{x}}),$ where $\sigma(x) = \max(x,0)$ (ReLU activation), and ${\mathbf{v}}\in \mathbb{R}^{50},{\mathbf{W}} \in \mathbb{R}^{50 \times 20}$ are the trainable weights. As in the previous example, the sparsity structure is preserved upon escaping from the origin.
  • ...and 1 more figures

Theorems & Definitions (34)

  • Lemma 1
  • Definition 2
  • Lemma 3
  • Theorem 4
  • Corollary 5
  • Corollary 6
  • Example 1
  • Lemma 7
  • Lemma 8
  • Lemma 9
  • ...and 24 more