Table of Contents
Fetching ...

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Akshay Kumar, Jarvis Haupt

TL;DR

The paper addresses the non-linear training dynamics of deep $L$-homogeneous networks with $L>2$ under small initialization by linking early weight-direction convergence to non-negative KKT points of a constrained Neural Correlation Function (NCF). It introduces a time-rescaled gradient-flow analysis, showing that, for small $\delta$, the weight norms stay $O(\delta)$ and directions align with KKT points of the constrained NCF (or collapse to zero) within a time horizon scaled by $1/\delta^{L-2}$. Beyond this, it characterizes rank-one KKT points for feed-forward networks with Leaky ReLU and polynomial Leaky ReLU activations, providing necessary and sufficient conditions and confirming them numerically as common in practice. The results offer insight into the emergent low-rank structure observed in early training and lay groundwork for understanding training dynamics in deeper, non-NTK regimes, with ReLU posing a notable future challenge. The work thus bridges theoretical understanding of deep homogeneous networks in the small-initialization regime with empirical observations of rank-one weight structures, potentially informing generalization and initialization strategies.

Abstract

This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. It is shown here that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in (Euclidean) norm and approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of the recently introduced neural correlation function. Additionally, this paper also studies the KKT points of the neural correlation function for feed-forward networks with (Leaky) ReLU and polynomial (Leaky) ReLU activations, deriving necessary and sufficient conditions for rank-one KKT points.

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

TL;DR

The paper addresses the non-linear training dynamics of deep -homogeneous networks with under small initialization by linking early weight-direction convergence to non-negative KKT points of a constrained Neural Correlation Function (NCF). It introduces a time-rescaled gradient-flow analysis, showing that, for small , the weight norms stay and directions align with KKT points of the constrained NCF (or collapse to zero) within a time horizon scaled by . Beyond this, it characterizes rank-one KKT points for feed-forward networks with Leaky ReLU and polynomial Leaky ReLU activations, providing necessary and sufficient conditions and confirming them numerically as common in practice. The results offer insight into the emergent low-rank structure observed in early training and lay groundwork for understanding training dynamics in deeper, non-NTK regimes, with ReLU posing a notable future challenge. The work thus bridges theoretical understanding of deep homogeneous networks in the small-initialization regime with empirical observations of rank-one weight structures, potentially informing generalization and initialization strategies.

Abstract

This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. It is shown here that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in (Euclidean) norm and approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of the recently introduced neural correlation function. Additionally, this paper also studies the KKT points of the neural correlation function for feed-forward networks with (Leaky) ReLU and polynomial (Leaky) ReLU activations, deriving necessary and sufficient conditions for rank-one KKT points.
Paper Structure (28 sections, 19 theorems, 265 equations, 10 figures, 6 tables)

This paper contains 28 sections, 19 theorems, 265 equations, 10 figures, 6 tables.

Key Result

Theorem 1

Let $\mathbf{w}_0$ be a fixed unit vector. Under L_homo_assumption, for any $\epsilon\in (0,\eta/2)$, where $\eta$ is a positive constant, there exists $T_\epsilon, \tilde{B}_\epsilon$ and $\overline{\delta}>0$ such that the following holds: for any $\delta\in(0,\overline{\delta})$ and solution $\ma Further, for $\overline{T}_\epsilon = T_\epsilon/\delta^{L-2}$, either where $\mathbf{u}_*$ is a n

Figures (10)

  • Figure 1:
  • Figure 2:
  • Figure 4: At initialization
  • Figure 5: At iteration 50360
  • Figure 7: ReLU activation
  • ...and 5 more figures

Theorems & Definitions (33)

  • Theorem 1
  • Lemma 2
  • Theorem 3
  • Theorem 4
  • Remark 1
  • Lemma 5
  • Definition A.1
  • Lemma 6
  • Lemma 7
  • Lemma 8
  • ...and 23 more