Table of Contents
Fetching ...

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

Ori Shem-Ur, Yaron Oz

TL;DR

The paper investigates why gradient‑descent learning on over‑parameterized networks tends to linearize in parameter dynamics, especially in the wide/NTK regime. It formalizes a mechanism in which linearization arises from weak correlations between the first and higher order derivatives of the hypothesis with respect to the initial parameters, and develops a random tensor asymptotics framework together with Tensor Programs to prove this across architectures. Central results include an equivalence between linearity and weak derivative correlations, and bounds on deviations from linearity under SGD, with wide neural networks serving as a canonical example of the theory in action. The introduced notions of subordinate tensor norms, stochastic big‑O bounds, and definite asymptotic bounds provide a general analytical toolkit for handling random tensors in high‑dimensional learning problems, with implications for understanding NTK behavior and the role of external scaling in linearization. Overall, the work offers a principled lens to understand the prevalence of linearized learning in wide networks and suggests avenues for exploiting weak correlations as a design principle or regularization mechanism in practice.

Abstract

Deep learning models, such as wide neural networks, can be conceptualized as nonlinear dynamical physical systems characterized by a multitude of interacting degrees of freedom. Such systems in the infinite limit, tend to exhibit simplified dynamics. This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics, reminiscent of the neural tangent kernel. We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function, concerning the parameters, taken around their initial values. This insight suggests that these weak correlations could be the underlying reason for the observed linearization in such systems. As a case in point, we showcase this weak correlations structure within neural networks in the large width limit. Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of stochastic gradient descent. To facilitate our proof, we introduce a novel method to characterise the asymptotic behavior of random tensors.

Weak Correlations as the Underlying Principle for Linearization of Gradient-Based Learning Systems

TL;DR

The paper investigates why gradient‑descent learning on over‑parameterized networks tends to linearize in parameter dynamics, especially in the wide/NTK regime. It formalizes a mechanism in which linearization arises from weak correlations between the first and higher order derivatives of the hypothesis with respect to the initial parameters, and develops a random tensor asymptotics framework together with Tensor Programs to prove this across architectures. Central results include an equivalence between linearity and weak derivative correlations, and bounds on deviations from linearity under SGD, with wide neural networks serving as a canonical example of the theory in action. The introduced notions of subordinate tensor norms, stochastic big‑O bounds, and definite asymptotic bounds provide a general analytical toolkit for handling random tensors in high‑dimensional learning problems, with implications for understanding NTK behavior and the role of external scaling in linearization. Overall, the work offers a principled lens to understand the prevalence of linearized learning in wide networks and suggests avenues for exploiting weak correlations as a design principle or regularization mechanism in practice.

Abstract

Deep learning models, such as wide neural networks, can be conceptualized as nonlinear dynamical physical systems characterized by a multitude of interacting degrees of freedom. Such systems in the infinite limit, tend to exhibit simplified dynamics. This paper delves into gradient descent-based learning algorithms, that display a linear structure in their parameter dynamics, reminiscent of the neural tangent kernel. We establish this apparent linearity arises due to weak correlations between the first and higher-order derivatives of the hypothesis function, concerning the parameters, taken around their initial values. This insight suggests that these weak correlations could be the underlying reason for the observed linearization in such systems. As a case in point, we showcase this weak correlations structure within neural networks in the large width limit. Exploiting the relationship between linearity and weak correlations, we derive a bound on deviations from linearity observed during the training trajectory of stochastic gradient descent. To facilitate our proof, we introduce a novel method to characterise the asymptotic behavior of random tensors.
Paper Structure (47 sections, 20 theorems, 154 equations)

This paper contains 47 sections, 20 theorems, 154 equations.

Key Result

Theorem 2.1

Consider a random tensor $M$ with a limiting parameter $n$ as described earlier. There exists $f\in\mathcal{N}$ serving as a tight/definite upper bound for $M$, satisfying: Furthermore, the asymptotic behavior of $f$ is unique.

Theorems & Definitions (60)

  • Definition 2.1: Asymptotic Upper Bound of Random Tensors
  • Remark 2.1
  • Remark 2.2
  • Theorem 2.1: Definite Asymptotic Bounds for Tensors
  • proof : Explanation
  • Remark 3.1
  • Definition 3.1: Derivatives Correlations
  • Theorem 3.1: Fixed Weak Correlations and Linearization Equivalence
  • Theorem 3.2: Exponential Weak Correlations and Linearization Equivalence
  • proof : Explanation
  • ...and 50 more