Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Ori Shem-Ur; Yaron Oz

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Ori Shem-Ur, Yaron Oz

TL;DR

The paper investigates why gradient‑descent learning on over‑parameterized networks tends to linearize in parameter dynamics, especially in the wide/NTK regime. It formalizes a mechanism in which linearization arises from weak correlations between the first and higher order derivatives of the hypothesis with respect to the initial parameters, and develops a random tensor asymptotics framework together with Tensor Programs to prove this across architectures. Central results include an equivalence between linearity and weak derivative correlations, and bounds on deviations from linearity under SGD, with wide neural networks serving as a canonical example of the theory in action. The introduced notions of subordinate tensor norms, stochastic big‑O bounds, and definite asymptotic bounds provide a general analytical toolkit for handling random tensors in high‑dimensional learning problems, with implications for understanding NTK behavior and the role of external scaling in linearization. Overall, the work offers a principled lens to understand the prevalence of linearized learning in wide networks and suggests avenues for exploiting weak correlations as a design principle or regularization mechanism in practice.

Abstract

Deep learning models, such as wide neural networks, can be conceptualized as nonlinear dynamical physical systems characterized by a multitude of interacting degrees of freedom. Such systems in the infinite limit, tend to exhibit simplified dynamics. This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics, reminiscent of the neural tangent kernel. We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function, concerning the parameters, taken around their initial values. This insight suggests that these weak correlations could be the underlying reason for the observed linearization in such systems. As a case in point, we showcase this weak correlations structure within neural networks in the large width limit. Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of stochastic gradient descent. To facilitate our proof, we introduce a novel method to characterise the asymptotic behavior of random tensors.

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

TL;DR

Abstract

Paper Structure (47 sections, 20 theorems, 154 equations)

This paper contains 47 sections, 20 theorems, 154 equations.

Introduction
Our Contributions
Random Tensor Asymptotic Behavior
The Subordinate Tensor Norm
Effectiveness of the Stochastic "Big O" Notation
The Definite Random Tensor Asymptotic Bound
Weak Correlations and Linearization
Notations for Supervised Learning
General notations
Neural Tangent Kernel Notations
The Derivatives Correlations
The Derivatives Correlations Definition
Equivalence of Linearity and Weak Derivatives Correlations
Our Main Theorems
External Scale and Hessian Spectral Norm
...and 32 more sections

Key Result

Theorem 2.1

Consider a random tensor $M$ with a limiting parameter $n$ as described earlier. There exists $f\in\mathcal{N}$ serving as a tight/definite upper bound for $M$, satisfying: Furthermore, the asymptotic behavior of $f$ is unique.

Theorems & Definitions (60)

Definition 2.1: Asymptotic Upper Bound of Random Tensors
Remark 2.1
Remark 2.2
Theorem 2.1: Definite Asymptotic Bounds for Tensors
proof : Explanation
Remark 3.1
Definition 3.1: Derivatives Correlations
Theorem 3.1: Fixed Weak Correlations and Linearization Equivalence
Theorem 3.2: Exponential Weak Correlations and Linearization Equivalence
proof : Explanation
...and 50 more

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

TL;DR

Abstract

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (60)