Table of Contents
Fetching ...

Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks

Akshay Kumar, Jarvis Haupt

TL;DR

This work analyzes gradient flow dynamics for two-homogeneous neural networks initialized near the origin and shows that, for square and logistic losses, the weights spend substantial time near zero and align in direction with non-negative KKT points of a neural correlation function (NCF). The authors introduce the NCF and prove directional convergence near initialization and near certain saddles using a rigorous framework based on differential inclusions and o-minimal definability for non-smooth losses, with corollaries for separable networks. They discuss higher-order homogeneity and identify open questions for extending results to general L-homogeneous networks. Overall, the results illuminate how small initializations steer early training toward specific directional patterns, offering a principled view of implicit regularization in non-smooth, overparameterized models.

Abstract

This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations, where all weights are initialized near the origin. For both square and logistic losses, it is shown that for sufficiently small initializations, the gradient flow dynamics spend sufficient time in the neighborhood of the origin to allow the weights of the neural network to approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of a neural correlation function that quantifies the correlation between the output of the neural network and corresponding labels in the training data set. For square loss, it has been observed that neural networks undergo saddle-to-saddle dynamics when initialized close to the origin. Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.

Directional Convergence Near Small Initializations and Saddles in Two-Homogeneous Neural Networks

TL;DR

This work analyzes gradient flow dynamics for two-homogeneous neural networks initialized near the origin and shows that, for square and logistic losses, the weights spend substantial time near zero and align in direction with non-negative KKT points of a neural correlation function (NCF). The authors introduce the NCF and prove directional convergence near initialization and near certain saddles using a rigorous framework based on differential inclusions and o-minimal definability for non-smooth losses, with corollaries for separable networks. They discuss higher-order homogeneity and identify open questions for extending results to general L-homogeneous networks. Overall, the results illuminate how small initializations steer early training toward specific directional patterns, offering a principled view of implicit regularization in non-smooth, overparameterized models.

Abstract

This paper examines gradient flow dynamics of two-homogeneous neural networks for small initializations, where all weights are initialized near the origin. For both square and logistic losses, it is shown that for sufficiently small initializations, the gradient flow dynamics spend sufficient time in the neighborhood of the origin to allow the weights of the neural network to approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of a neural correlation function that quantifies the correlation between the output of the neural network and corresponding labels in the training data set. For square loss, it has been observed that neural networks undergo saddle-to-saddle dynamics when initialized close to the origin. Motivated by this, this paper also shows a similar directional convergence among weights of small magnitude in the neighborhood of certain saddle points.
Paper Structure (33 sections, 27 theorems, 245 equations, 4 figures)

This paper contains 33 sections, 27 theorems, 245 equations, 4 figures.

Key Result

Theorem 5.1

Let $\mathbf{w}_0$ be a unit norm vector and a non-branching initialization of the differential inclusion For any $\epsilon\in(0,\eta)$, where $\eta$ is a positive constantHere, $\eta$ depends on the solution of main_flow_thm, which solely relies on $\mathbf{X}, \mathbf{y}, \mathcal{H}, \mathbf{w}_0$, and is independent of $\delta$. See low_bd_U and the proof for more details., there exist $C>1$

Figures (4)

  • Figure 1: A two-dimensional scenario where a single-layer squared ReLU neural network with $20$ hidden neurons is trained by gradient descent. The network architecture is defined as $\mathcal{H}(x_1,x_2;\{\mathbf{u}_i\}_{i=1}^{20}) = \sum_{i=1}^{20}\max(0,\mathbf{u}_{1i}x_1+\mathbf{u}_{2i}x_2)^2$, where $\mathbf{u}_i$ represents the weights for the $i$th neuron. For training, we use 50 unit norm inputs and corresponding labels are generated using the function $\mathcal{H}^*(x_1,x_2) = 5\max(0,x_1)^2+4\max(0,-x_1)^2$. We use square loss and optimize using gradient descent for 50000 iterations with step-size $5\cdot 10^{-5}$ . At initialization, the weights of each hidden neuron are drawn from Gaussian distribution with standard deviation $10^{-5}$. Panel $(a)$: the evolution of training loss and the $\ell_2$-norm of all the weights with iterations. Panel $(b)$: the evolution of $\arctan(\mathbf{u}_{2i}(t)/\mathbf{u}_{1i}(t))$ (the angle $\mathbf{u}_{i}(t)$ makes with the positive $x-$axis) for all hidden neurons. We see that the norm of the weights remain small and loss barely changes, though the weight vectors converge in direction to their final location (denoted with red dots).
  • Figure 2: The gradient field of $f(u_1,u_2) = u_1|u_2|$.
  • Figure 3: The lower part shows the content of \ref{['near_init_dir_evol']} with the horizontal and vertical axes interchanged. The top plot shows the constrained NCF $\mathcal{N}_{\mathbf{y},\mathcal{H}}(\theta)= \sum_{i=1}^n y_i\max(0,[\cos(\theta), \sin(\theta)]^\top\mathbf{x}_i)^2$. As predicted by \ref{['sep_nn']}, the neuron weights converge in direction to the KKT points of the NCF.
  • Figure 4: Panel $(a)$: the evolution of training loss and the $\ell_2$-distance of the weights from the saddle point with iterations. Panel $(b)$: The lower part shows the evolution of $\arctan(\mathbf{u}_{2i}(t)/\mathbf{u}_{1i}(t))$ for the last 10 hidden neurons. The top plot shows the constrained NCF $\mathcal{N}_{\bar{\mathbf{y}},\mathcal{H}}(\theta)= \sum_{i=1}^n \bar{y}_i\max(0,[\cos(\theta), \sin(\theta)]^\top\mathbf{x}_i)^2$, where $\bar{\mathbf{y}}$ is the residual error at the saddle point. We see that the weights remain near the saddle point and loss barely changes, though the weights of the last 10 neurons converge in direction to the KKT points of the constrained NCF.

Theorems & Definitions (42)

  • Definition 4.1
  • Theorem 5.1
  • Lemma 5.2
  • Lemma 5.3
  • Lemma 5.4
  • Corollary 5.4.1
  • Theorem 5.5
  • Lemma 5.6
  • Lemma A.1
  • Lemma A.2
  • ...and 32 more