Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Akshay Kumar; Jarvis Haupt

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Akshay Kumar, Jarvis Haupt

TL;DR

The paper addresses the non-linear training dynamics of deep $L$-homogeneous networks with $L>2$ under small initialization by linking early weight-direction convergence to non-negative KKT points of a constrained Neural Correlation Function (NCF). It introduces a time-rescaled gradient-flow analysis, showing that, for small $\delta$, the weight norms stay $O(\delta)$ and directions align with KKT points of the constrained NCF (or collapse to zero) within a time horizon scaled by $1/\delta^{L-2}$. Beyond this, it characterizes rank-one KKT points for feed-forward networks with Leaky ReLU and polynomial Leaky ReLU activations, providing necessary and sufficient conditions and confirming them numerically as common in practice. The results offer insight into the emergent low-rank structure observed in early training and lay groundwork for understanding training dynamics in deeper, non-NTK regimes, with ReLU posing a notable future challenge. The work thus bridges theoretical understanding of deep homogeneous networks in the small-initialization regime with empirical observations of rank-one weight structures, potentially informing generalization and initialization strategies.

Abstract

This paper studies the gradient flow dynamics that arise when training deep homogeneous neural networks assumed to have locally Lipschitz gradients and an order of homogeneity strictly greater than two. It is shown here that for sufficiently small initializations, during the early stages of training, the weights of the neural network remain small in (Euclidean) norm and approximately converge in direction to the Karush-Kuhn-Tucker (KKT) points of the recently introduced neural correlation function. Additionally, this paper also studies the KKT points of the neural correlation function for feed-forward networks with (Leaky) ReLU and polynomial (Leaky) ReLU activations, deriving necessary and sufficient conditions for rank-one KKT points.

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

TL;DR

The paper addresses the non-linear training dynamics of deep

-homogeneous networks with

under small initialization by linking early weight-direction convergence to non-negative KKT points of a constrained Neural Correlation Function (NCF). It introduces a time-rescaled gradient-flow analysis, showing that, for small

, the weight norms stay

and directions align with KKT points of the constrained NCF (or collapse to zero) within a time horizon scaled by

. Beyond this, it characterizes rank-one KKT points for feed-forward networks with Leaky ReLU and polynomial Leaky ReLU activations, providing necessary and sufficient conditions and confirming them numerically as common in practice. The results offer insight into the emergent low-rank structure observed in early training and lay groundwork for understanding training dynamics in deeper, non-NTK regimes, with ReLU posing a notable future challenge. The work thus bridges theoretical understanding of deep homogeneous networks in the small-initialization regime with empirical observations of rank-one weight structures, potentially informing generalization and initialization strategies.

Abstract

Paper Structure (28 sections, 19 theorems, 265 equations, 10 figures, 6 tables)

This paper contains 28 sections, 19 theorems, 265 equations, 10 figures, 6 tables.

Introduction
Related Works
Problem Formulation
Early Directional Convergence
Main Result
Proof Sketch of \ref{['thm_align_init']}
Challenges for non-differentiable neural networks:
A Closer Look at the Weights
KKT Points of Constrained NCF
Leaky ReLU
Polynomial Leaky ReLU
Numerical Experiments
Proof Overview
Conclusion and Future Directions
Key lemmata
...and 13 more sections

Key Result

Theorem 1

Let $\mathbf{w}_0$ be a fixed unit vector. Under L_homo_assumption, for any $\epsilon\in (0,\eta/2)$, where $\eta$ is a positive constant, there exists $T_\epsilon, \tilde{B}_\epsilon$ and $\overline{\delta}>0$ such that the following holds: for any $\delta\in(0,\overline{\delta})$ and solution $\ma Further, for $\overline{T}_\epsilon = T_\epsilon/\delta^{L-2}$, either where $\mathbf{u}_*$ is a n

Figures (10)

Figure 1:
Figure 2:
Figure 4: At initialization
Figure 5: At iteration 50360
Figure 7: ReLU activation
...and 5 more figures

Theorems & Definitions (33)

Theorem 1
Lemma 2
Theorem 3
Theorem 4
Remark 1
Lemma 5
Definition A.1
Lemma 6
Lemma 7
Lemma 8
...and 23 more

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

TL;DR

Abstract

Early Directional Convergence in Deep Homogeneous Neural Networks for Small Initializations

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (33)