From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Clémentine C. J. Dominé; Nicolas Anguita; Alexandra M. Proca; Lukas Braun; Daniel Kunin; Pedro A. M. Mediano; Andrew M. Saxe

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Clémentine C. J. Dominé, Nicolas Anguita, Alexandra M. Proca, Lukas Braun, Daniel Kunin, Pedro A. M. Mediano, Andrew M. Saxe

TL;DR

The paper provides an exact analytical treatment of learning dynamics in deep linear networks under lambda-balanced initializations, revealing a controllable spectrum between rich feature learning and lazy kernel-like behavior. By deriving a closed-form solution for the gradient flow of key statistics and the finite-width NTK, it shows how relative layer scaling shapes internal representations, the evolution of singular values, and the NTK, with broad implications for continual, reversal, and transfer learning as well as fine-tuning. The results clarify how initialization interacts with architecture to determine when representations become task-specific or task-agnostic, and offer practical guidance for initializing networks to favor desirable learning regimes. This work advances both theoretical understanding and practical guidance for initialization strategies in machine learning and provides a bridge to neuroscience by connecting regime dynamics with representation learning.

Abstract

Biological and artificial neural networks develop internal representations that enable them to perform complex tasks. In artificial networks, the effectiveness of these models relies on their ability to build task specific representation, a process influenced by interactions among datasets, architectures, initialization strategies, and optimization algorithms. Prior studies highlight that different initializations can place networks in either a lazy regime, where representations remain static, or a rich/feature learning regime, where representations evolve dynamically. Here, we examine how initialization influences learning dynamics in deep linear neural networks, deriving exact solutions for lambda-balanced initializations-defined by the relative scale of weights across layers. These solutions capture the evolution of representations and the Neural Tangent Kernel across the spectrum from the rich to the lazy regimes. Our findings deepen the theoretical understanding of the impact of weight initialization on learning regimes, with implications for continual learning, reversal learning, and transfer learning, relevant to both neuroscience and practical applications.

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

TL;DR

Abstract

Paper Structure (47 sections, 18 theorems, 174 equations, 14 figures, 3 tables, 1 algorithm)

This paper contains 47 sections, 18 theorems, 174 equations, 14 figures, 3 tables, 1 algorithm.

Introduction
Related Work
Preliminaries
Exact Learning Dynamics
Rich and Lazy Learning
Applications
Discussion
Preliminaries
Appendix: Balanced Condition
Discussion Assumptions
Whittened Inputs.
Dimension.
Full rank
Balancedness Assumption
Random weight initialisations and $\lambda$-Balanced Property
...and 32 more sections

Key Result

Lemma 4.1

Under assumptions ass:whitened and ass:lambda-balanced, the gradient flow dynamics of $\mathbf{Q}\mathbf{Q}^T(t)$, with initalization ${\bf Q}{\bf Q}^T(0) = {\bf Q}(0) {\bf Q}(0)^T$ can be written as a differential matrix Riccati equation

Figures (14)

Figure 1: A minimal model of the rich and lazy regimes. A. We examine a deep and wide linear network trained using gradient descent starting from an initialization characterized by a relative scale parameter $\lambda$ — which characterizes the difference in the weight covariance between the first and second layers. B. Network output for an example task over training time, starting from a range of relative scale values. The dynamics are influenced by the initialization. Solid lines represent simulations, while dotted lines indicate the analytical solutions derived in this work. C. A network with LeCun weight initialization lecun1998gradient in the infinite width limit becomes $\lambda$-balanced, as ${\bf W}_2^T{\bf W}_2 - {\bf W}_1{\bf W}_1^T$ approaches the scaled identity matrix.
Figure 2: $\textbf{A.}$ The temporal dynamics of the numerical simulation (colored lines) of the loss, network function, correlation of input and output weights, and the NTK (row 1-5 respectively) are exactly matched by the analytical solution (black dotted lines) for $\lambda = -2$. $\textbf{B.}$blue $\lambda = 0$Large initial weight values. $\textbf{C.}$$\lambda = 2$ initial weight values initialized as described in \ref{['app:simulation-details']}.
Figure 3: Simulated and analytical dynamics of the singular values of the network function with relative scale of A.$\lambda = -2$, B.$\lambda = 0$, or C.$\lambda = 2$, initialized as described in Appendix \ref{['app:simulation-details']}.
Figure 4: A. A semantic learning task with the SVD of the input-output correlation matrix of the task. (top) $U$ and $V$ represent the singular vectors, and $S$ contains the singular values. (bottom) The respective RSMs as $USU^\top$ for the input and $VSV^\top$ for the output task. blue B.Simulation results and C. Theoretical input and output representation matrices after training, showing convergence when initialized with values of $\lambda$ equal to $-2$, $0$, and $2$, according to the initialization scheme described in Appendix \ref{['app:simulation-details']}. D. Final RSMs matrices after training converged when initialised from random large weights. E. After convergence, the network's sensitivity to input noise (top panel) is invariant to $\lambda$, but the sensitivity to parameter noise increases as $\lambda$ becomes smaller (or larger) than zero.
Figure 5: A. Schematic representations of the network architectures considered, from left to right: funnel network, square network, and inverted-funnel network. B. The plot shows the NTK kernel distance from initialization, as defined in fort2020deep across the three architecture depicted schematically. C. The NTK kernel distance away from initialization over training time.
...and 9 more figures

Theorems & Definitions (32)

Lemma 4.1
Lemma 4.2
Theorem 4.3
Theorem 5.1
Theorem 5.2
Definition A.1: Definition of $\lambda$-balanced property (saxe_2014_exact, marcotte_abide)
Theorem A.2
proof : Proof
Theorem A.3
proof : Proof of Theorem \ref{['theorem:random_balanced']}
...and 22 more

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

TL;DR

Abstract

From Lazy to Rich: Exact Learning Dynamics in Deep Linear Networks

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (32)