Table of Contents
Fetching ...

Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions

Dmitry Yarotsky, Eugene Golikov, Yaroslav Gusev

TL;DR

Gradient Flow Through Diagram Expansions develops a unifying framework to analyze gradient flow in large-scale CP-decomposition models by expressing the loss evolution as a time-power series whose coefficients are encoded by diagrammatic graphs. The authors derive a large-size Pareto polygon taxonomy that links scaling relations between input size $p$, hidden size $H$, and initialization variance $\sigma^2$ to distinct learning regimes (NTK, mean-field, under-/overparameterized) and, in some cases, obtain explicit, closed-form GF solutions via a PDE-characteristics method. They further show how summations of the formal series can be carried out in many regimes, yielding explicit loss dynamics and phase diagrams that agree with experiments on identity targets and modular arithmetic tasks. The work provides insights into when feature learning occurs, clarifies the role of symmetry, and offers a potentially generalizable toolkit for analyzing nonlinear gradient flows in high-dimensional learning systems.

Abstract

We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.

Gradient Flow Through Diagram Expansions: Learning Regimes and Explicit Solutions

TL;DR

Gradient Flow Through Diagram Expansions develops a unifying framework to analyze gradient flow in large-scale CP-decomposition models by expressing the loss evolution as a time-power series whose coefficients are encoded by diagrammatic graphs. The authors derive a large-size Pareto polygon taxonomy that links scaling relations between input size , hidden size , and initialization variance to distinct learning regimes (NTK, mean-field, under-/overparameterized) and, in some cases, obtain explicit, closed-form GF solutions via a PDE-characteristics method. They further show how summations of the formal series can be carried out in many regimes, yielding explicit loss dynamics and phase diagrams that agree with experiments on identity targets and modular arithmetic tasks. The work provides insights into when feature learning occurs, clarifies the role of symmetry, and offers a potentially generalizable toolkit for analyzing nonlinear gradient flows in high-dimensional learning systems.

Abstract

We develop a general mathematical framework to analyze scaling regimes and derive explicit analytic solutions for gradient flow (GF) in large learning problems. Our key innovation is a formal power series expansion of the loss evolution, with coefficients encoded by diagrams akin to Feynman diagrams. We show that this expansion has a well-defined large-size limit that can be used to reveal different learning phases and, in some cases, to obtain explicit solutions of the nonlinear GF. We focus on learning Canonical Polyadic (CP) decompositions of high-order tensors, and show that this model has several distinct extreme lazy and rich GF regimes such as free evolution, NTK and under- and over-parameterized mean-field. We show that these regimes depend on the parameter scaling, tensor order, and symmetry of the model in a specific and subtle way. Moreover, we propose a general approach to summing the formal loss expansion by reducing it to a PDE; in a wide range of scenarios, it turns out to be 1st order and solvable by the method of characteristics. We observe a very good agreement of our theoretical predictions with experiment.
Paper Structure (112 sections, 8 theorems, 181 equations, 11 figures, 2 tables)

This paper contains 112 sections, 8 theorems, 181 equations, 11 figures, 2 tables.

Key Result

Theorem 3.1

Suppose that the target tensor $F_{i_1,\ldots,i_\nu}$ can be written as a polynomial in $H,p$, indices $i_1,\ldots,i_\nu$ and Kronecker deltas $\delta_{i_a=i_b}$ for $a,b\in\overline{1,\nu}$. Then, for any $s$, $T^s \mathbb E[d^s L/dt^s(0)]$ is a polynomial in $H,p,\sigma^2$.

Figures (11)

  • Figure 1: Top row: Diagrams in ASYM with $\nu=3$ (three colors). Yellow squares: $H$-nodes; cyan circles: $p$-nodes. a) $D_{6}$; b) $R_3$ with a general target hyper-edge $F$; c) $R_3$ for the identity target \ref{['eq:idtarget']}; diagrams appearing d) in $D_6\star D_6$ (up to recoloring); e) in $D_6\star R_3$; f) in $R_3\star R_3$. Bottom row: generic diagrams in different regimes. g) a diagram from $D_{6}^{\star s}$ (free evolution); h) a "flower" with one $H$-node (contracted, underparameterized); i) an optimally contracted free underparameterized; j) circular ($\nu=2$); k) circular optimally contracted to a tree (SYM, $\nu=2$); l) generic optimally contracted in SYM, $\nu=4$.
  • Figure 2: Pareto optimal terms (black, see Theorem \ref{['th:pareto_front']}) and the corresponding Pareto polygons (red) for target \ref{['eq:idtarget']}. The extremal points present in SYM but missing from ASYM are colored gray.
  • Figure 3: Free ASYM $\nu=3$ (Sec. \ref{['sec:free-evo']}). Left: theory for the overparameterized case verified experimentally with $p=32,H=p^3=32768$. Right: theory for the underparameterized case verified experimentally with $p=H=128$.
  • Figure 4: Experimental confirmation of theoretical predictions (eq. \ref{['eq:loss-narayana']}) for the loss evolution in SYM $\nu=2$. Experiments were made for a fixed $p=512$ but varying $H=1024,512,256$.
  • Figure 5: SYM $\nu=4.$Left: experimental confirmation of the boundary \ref{['eq:nu4-threshold']}. Gradient ascent divergence rates were obtained in 10 independent experiments for different values of $\rho=p^3\sigma^4, \theta = 1+3H/p^2$ with a fixed $p=16$. Right: typical examples of experimental curves together with theoretical predictions (eq. \ref{['eq:nu4-loss']}) for both high-noise and low-noise scenarios.
  • ...and 6 more figures

Theorems & Definitions (12)

  • Theorem 3.1: \ref{['app:diagcalc']}
  • Theorem 4.1: \ref{['sec:th:pareto_front:proof']}
  • Theorem 6.1
  • Proposition 8.1
  • Proposition 8.2
  • proof : Proof of Theorem \ref{['th:recurrence']}
  • Definition 8.1
  • Proposition 8.2
  • Proposition 8.3
  • Conjecture 8.4
  • ...and 2 more