Table of Contents
Fetching ...

Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)

Yoonsoo Nam, Seok Hyeong Lee, Clementine C J Domine, Yeachan Park, Charles London, Wonyl Choi, Niclas Goring, Seungjai Lee

TL;DR

This work argues that solving layerwise linear models under the dynamical feedback principle can illuminate core neural dynamics such as neural collapse, emergence, lazy/rich regimes, and grokking. By focusing on solvable, multilayer linear architectures, the authors derive sigmoidal and stage-like learning, connect these dynamics to empirical DNN phenomena, and show how layer imbalance and weight-to-target ratios control learning regimes. The contributions include formalizing the dynamical feedback principle, analyzing toy models, and interpreting phenomena like NC and grokking within a unified layerwise framework. The approach offers a principled, analytic lens to understand DNN behavior, with potential to guide initialization, architecture design, and training strategies toward more interpretable and generalizable models.

Abstract

In physics, complex systems are often simplified into minimal, solvable models that retain only the core principles. In machine learning, layerwise linear models (e.g., linear neural networks) act as simplified representations of neural network dynamics. These models follow the dynamical feedback principle, which describes how layers mutually govern and amplify each other's evolution. This principle extends beyond the simplified models, successfully explaining a wide range of dynamical phenomena in deep neural networks, including neural collapse, emergence, lazy and rich regimes, and grokking. In this position paper, we call for the use of layerwise linear models retaining the core principles of neural dynamical phenomena to accelerate the science of deep learning.

Position: Solve Layerwise Linear Models First to Understand Neural Dynamical Phenomena (Neural Collapse, Emergence, Lazy/Rich Regime, and Grokking)

TL;DR

This work argues that solving layerwise linear models under the dynamical feedback principle can illuminate core neural dynamics such as neural collapse, emergence, lazy/rich regimes, and grokking. By focusing on solvable, multilayer linear architectures, the authors derive sigmoidal and stage-like learning, connect these dynamics to empirical DNN phenomena, and show how layer imbalance and weight-to-target ratios control learning regimes. The contributions include formalizing the dynamical feedback principle, analyzing toy models, and interpreting phenomena like NC and grokking within a unified layerwise framework. The approach offers a principled, analytic lens to understand DNN behavior, with potential to guide initialization, architecture design, and training strategies toward more interpretable and generalizable models.

Abstract

In physics, complex systems are often simplified into minimal, solvable models that retain only the core principles. In machine learning, layerwise linear models (e.g., linear neural networks) act as simplified representations of neural network dynamics. These models follow the dynamical feedback principle, which describes how layers mutually govern and amplify each other's evolution. This principle extends beyond the simplified models, successfully explaining a wide range of dynamical phenomena in deep neural networks, including neural collapse, emergence, lazy and rich regimes, and grokking. In this position paper, we call for the use of layerwise linear models retaining the core principles of neural dynamical phenomena to accelerate the science of deep learning.

Paper Structure

This paper contains 83 sections, 2 theorems, 86 equations, 10 figures.

Key Result

Theorem 2

Under the assumptions of whitened inputs (1), $\lambda$-balanced weights (2), and no bottleneck and with a task-aligned initialization, as defined in saxe2013exact, the network function is given by the expression $W_2W_1(t) = \tilde{U}P(t)\tilde{V}^T$ where $P(t) \in \mathbb{R}^{c \times c}$ is a di where $\tilde{\rho}_\alpha$ is the $\alpha$ singular value of the correlation matrix and $\gamma_\a

Figures (10)

  • Figure 1: Paper outline For a systematic presentation of various works on layerwise linear models, we begin the section by building intuition on how the key principle (green) behaves under a condition (yellow). We then formalize this intuition as a key property (blue) using a solvable layerwise linear model. Finally, we discuss how this property from the layerwise linear model extends to describe an empirical phenomenon in DNNs (red).
  • Figure 2: Layerwise linear models. Layerwise linear models (\ref{['app:functionspace']}) include, but are not limited to, (a): diagonal linear neural networks (\ref{['eq:lin_mullin']}) and (b): linear neural networks (\ref{['eq:linear_nn']}). The layerwise structure leads to distinct dynamics compared to linear models (\ref{['eq:lin']}).
  • Figure 3: Dynamics of the linear model and the diagonal linear neural network. The colored lines show the saturation curves of modes with different variances for (a): the linear model (\ref{['eq:dynamics_linear']}) and (b): the diagonal linear neural network (\ref{['eq:dynamics']}) with $S_i=1$. For the linear model, all $\theta_i$'s saturate from $t=0$ only differing in the saturation speed. For the layerwise model, $a_ib_i$'s show delayed saturations depending on $\mathbf{E}[x^2_i]$, learning the modes in sequences.
  • Figure 4: Predicting emergence with layerwise linear model (Figure 1 from nam2024exactly). The skill strength $\mathcal{R}_k =\mathbf{E}[f(x)g_k(x)]/\mathbf{E}[(g_k(x))^2]$ measures the linear correlation between the $k^{\mathrm{th}}$ skill function and the learned function or how well the skill is learned. Each color represents a different skill. The solid lines show the theoretical layerwise model (\ref{['eq:skill_function']}) calibrated on the first skill, while the dashed lines represent the empirical results of a 2-layer neural network trained on the multitask sparse parity problem.
  • Figure 5: Illustration of neural collapse. In NC, the last layer feature vectors (the post-activation of the penultimate layer), illustrated as colored dots, cluster by their class mean vector, illustrated as colored arrow, and form a simplex ETF structure (orthogonal vectors projected at the compliment of the global mean vector).
  • ...and 5 more figures

Theorems & Definitions (6)

  • Definition 1: Definition of $\lambda$-balanced Property (saxe2013exactmarcotte_abide)
  • Theorem 2
  • Theorem 3
  • proof
  • Remark 4
  • Remark 5