Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

Thomas Chen

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

Thomas Chen

TL;DR

This work derives explicit gradient-flow equations for cumulative weights and biases of a deep ReLU network under Euclidean input-space cost, assuming alignment of weights to the activation. It shows that gradient flow acts as a dynamical truncation of training data in input space, causing data clusters to shrink and, in favorable cases, collapse to points, offering an interpretable view linked to neural collapse. The analysis covers both cluster-separated truncations with explicit ODEs and the general case without separation, and it connects these dynamics to standard cost scenarios where a spectral gap ensures exponential convergence. Overall, the results provide a rigorous, geometry-driven explanation for how training data structures evolve under gradient descent and illuminate interpretability questions in supervised learning.

Abstract

We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity ("truncated") at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

TL;DR

Abstract

Paper Structure (20 sections, 14 theorems, 239 equations)

This paper contains 20 sections, 14 theorems, 239 equations.

Introduction
Definition of the mathematical model
Standard cost is pullback cost in input layer
Euclidean cost in input layer
Weights adapted to the activation
Gradient flow in input space for cluster separated truncations
Definitions and notations
Gradient flow for Euclidean cost
Gradient flow and moments of $\mu_\ell$
Geometry of orbits for cluster separated truncations
Equilibria
Flow of $\beta^{(\ell)}$ and $R_\ell$ for partially truncated initial data
Flow of $\beta^{(\ell)}$ at fixed $R_\ell$
General gradient flow without cluster separated truncations
Proof of Theorem \ref{['thm-gradflow-input-1-0']}
...and 5 more sections

Key Result

Theorem 3.2

Let $\ell\in\{1,\dots,Q\}$, and for notational convenience, with $W^{(Q+1)}$ fixed. Assume that that the $\ell$-th truncation map $\tau^{(\ell)}$ acts as the identity on all clusters $\ell'\neq\ell$ so that eq-tauinv-1-0 holds, and that $W^{(\ell)}$ and $\sigma$ are aligned. Then, we may assume without any loss of generality that Let denote the affine map associated to the $\ell$-th hidden laye

Theorems & Definitions (29)

Definition 2.1
Definition 2.2
Definition 3.1
Theorem 3.2
Definition 3.3
Definition 3.4: Free and constrained moments of $\mu_\ell$
Corollary 3.5
Proposition 4.1
proof
Proposition 4.2
...and 19 more

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

TL;DR

Abstract

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Theorems & Definitions (29)