Table of Contents
Fetching ...

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

Thomas Chen

TL;DR

This work derives explicit gradient-flow equations for cumulative weights and biases of a deep ReLU network under Euclidean input-space cost, assuming alignment of weights to the activation. It shows that gradient flow acts as a dynamical truncation of training data in input space, causing data clusters to shrink and, in favorable cases, collapse to points, offering an interpretable view linked to neural collapse. The analysis covers both cluster-separated truncations with explicit ODEs and the general case without separation, and it connects these dynamics to standard cost scenarios where a spectral gap ensures exponential convergence. Overall, the results provide a rigorous, geometry-driven explanation for how training data structures evolve under gradient descent and illuminate interpretability questions in supervised learning.

Abstract

We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity ("truncated") at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.

Derivation of effective gradient flow equations and dynamical truncation of training data in Deep Learning

TL;DR

This work derives explicit gradient-flow equations for cumulative weights and biases of a deep ReLU network under Euclidean input-space cost, assuming alignment of weights to the activation. It shows that gradient flow acts as a dynamical truncation of training data in input space, causing data clusters to shrink and, in favorable cases, collapse to points, offering an interpretable view linked to neural collapse. The analysis covers both cluster-separated truncations with explicit ODEs and the general case without separation, and it connects these dynamics to standard cost scenarios where a spectral gap ensures exponential convergence. Overall, the results provide a rigorous, geometry-driven explanation for how training data structures evolve under gradient descent and illuminate interpretability questions in supervised learning.

Abstract

We derive explicit equations governing the cumulative biases and weights in Deep Learning with ReLU activation function, based on gradient descent for the Euclidean cost in the input layer, and under the assumption that the weights are, in a precise sense, adapted to the coordinate system distinguished by the activations. We show that gradient descent corresponds to a dynamical process in the input layer, whereby clusters of data are progressively reduced in complexity ("truncated") at an exponential rate that increases with the number of data points that have already been truncated. We provide a detailed discussion of several types of solutions to the gradient flow equations. A main motivation for this work is to shed light on the interpretability question in supervised learning.
Paper Structure (20 sections, 14 theorems, 239 equations)

This paper contains 20 sections, 14 theorems, 239 equations.

Key Result

Theorem 3.2

Let $\ell\in\{1,\dots,Q\}$, and for notational convenience, with $W^{(Q+1)}$ fixed. Assume that that the $\ell$-th truncation map $\tau^{(\ell)}$ acts as the identity on all clusters $\ell'\neq\ell$ so that eq-tauinv-1-0 holds, and that $W^{(\ell)}$ and $\sigma$ are aligned. Then, we may assume without any loss of generality that Let denote the affine map associated to the $\ell$-th hidden laye

Theorems & Definitions (29)

  • Definition 2.1
  • Definition 2.2
  • Definition 3.1
  • Theorem 3.2
  • Definition 3.3
  • Definition 3.4: Free and constrained moments of $\mu_\ell$
  • Corollary 3.5
  • Proposition 4.1
  • proof
  • Proposition 4.2
  • ...and 19 more