Table of Contents
Fetching ...

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

Daniel Kunin, Giovanni Luca Marchetti, Feng Chen, Dhruva Karkada, James B. Simon, Michael R. DeWeese, Surya Ganguli, Nina Miolane

TL;DR

The paper introduces Alternating Gradient Flows (AGF), a two-step framework that models feature learning in two-layer networks trained from small initialization by alternating between dormant utility maximization and active cost minimization. AGF reproduces and unifies a range of saddle-to-saddle analyses across architectures, including diagonal linear networks, fully connected linear networks, and attention-only linear transformers, and provides a complete theory for modular addition where Fourier features are learned in decreasing frequency order. In the vanishing initialization limit, AGF converges to gradient flow in diagonal linear networks and aligns with known greedy low-rank learning dynamics, offering a principled explanation for the order and timing of feature emergence. The framework also extends to predicting Fourier feature learning in modular arithmetic and suggests broader implications for connecting optimization dynamics with mechanistic interpretability across simple two-layer models, with future work aimed at deeper architectures and more general data regimes.

Abstract

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

Alternating Gradient Flows: A Theory of Feature Learning in Two-layer Neural Networks

TL;DR

The paper introduces Alternating Gradient Flows (AGF), a two-step framework that models feature learning in two-layer networks trained from small initialization by alternating between dormant utility maximization and active cost minimization. AGF reproduces and unifies a range of saddle-to-saddle analyses across architectures, including diagonal linear networks, fully connected linear networks, and attention-only linear transformers, and provides a complete theory for modular addition where Fourier features are learned in decreasing frequency order. In the vanishing initialization limit, AGF converges to gradient flow in diagonal linear networks and aligns with known greedy low-rank learning dynamics, offering a principled explanation for the order and timing of feature emergence. The framework also extends to predicting Fourier feature learning in modular arithmetic and suggests broader implications for connecting optimization dynamics with mechanistic interpretability across simple two-layer models, with future work aimed at deeper architectures and more general data regimes.

Abstract

What features neural networks learn, and how, remains an open question. In this paper, we introduce Alternating Gradient Flows (AGF), an algorithmic framework that describes the dynamics of feature learning in two-layer networks trained from small initialization. Prior works have shown that gradient flow in this regime exhibits a staircase-like loss curve, alternating between plateaus where neurons slowly align to useful directions and sharp drops where neurons rapidly grow in norm. AGF approximates this behavior as an alternating two-step process: maximizing a utility function over dormant neurons and minimizing a cost function over active ones. AGF begins with all neurons dormant. At each iteration, a dormant neuron activates, triggering the acquisition of a feature and a drop in the loss. AGF quantifies the order, timing, and magnitude of these drops, matching experiments across several commonly studied architectures. We show that AGF unifies and extends existing saddle-to-saddle analyses in fully connected linear networks and attention-only linear transformers, where the learned features are singular modes and principal components, respectively. In diagonal linear networks, we prove AGF converges to gradient flow in the limit of vanishing initialization. Applying AGF to quadratic networks trained to perform modular addition, we give the first complete characterization of the training dynamics, revealing that networks learn Fourier features in decreasing order of coefficient magnitude. Altogether, AGF offers a promising step towards understanding feature learning in neural networks.

Paper Structure

This paper contains 60 sections, 21 theorems, 97 equations, 12 figures, 3 algorithms.

Key Result

Theorem 3.1

Let $(\beta_{\text{AGF}}, t_{\text{AGF}})$ and $(\beta_{\text{PF}}, t_{\text{PF}})$ be the sequences produced by AGF and Algorithm 1 of pesme2023saddle, respectively. Then, $(\beta_{\text{AGF}}, t_{\text{AGF}}) \to \left(\beta_{\text{PF}}, t_{\text{PF}}\right)$ pointwise as $\alpha \to 0$.

Figures (12)

  • Figure 1: A unified theory of feature learning in two-layer networks. Left: Alternating Gradient Flows (AGF) models feature learning as a two-step process alternating between utility maximization (blue plateaus) and cost minimization (red drops), where each drop reflects learning a new feature (see \ref{['sec:framework']}). Middle: AGF unifies prior analyses of saddle-to-saddle dynamics (see \ref{['sec:diagnn', 'sec:unifying-analysis']}). Right: AGF enables new analysis of empirical phenomena (see \ref{['sec:modular-addition']}).
  • Figure 3: AGF $=$ GF as $\alpha \to 0$ in diagonal linear networks. Training loss curves for a diagonal linear network under the setup described in \ref{['sec:diagnn']} for various initialization values $\alpha$. As $\alpha \to 0$, the trajectory predicted by AGF and the empirics of gradient flow converge. To ensure a meaningful comparison between experiments we set $\eta = -\log(\alpha)$.
  • Figure 4: Stepwise singular value decomposition. Training a two-layer fully connected linear network on Gaussian inputs with a power-law covariance $\Sigma_{xx}$ and labels $y(x) = Bx$ generated from a random $B$. We show the dynamics of the singular values of the network's map $AW$ when $\Sigma_{xx}$ commutes with $\Sigma_{yx}^\intercal \Sigma_{yx}$ (a) and when it does not (b). \ref{['conj:fully-connected-linear']} (black dashed lines) predicts the dynamics well.
  • Figure 5: Stepwise principal component regression. Training a linear transformer to learn linear regression in context. We show the evolution of singular values of $\sum_{i=1}^H V_i K_i Q_i^\intercal$. Horizontal lines show theoretical $A_k$ and vertical dashed lines show lower bounds for the jump time from \ref{['eq:in-context-learning-sequence']} with $l = k-1$. Dashed black lines are numerical AGF predictions.
  • Figure 6: Stepwise Fourier decomposition. We train a two-layer quadratic network on a modular addition task with $p = 20$, using a template vector $x \in \mathbb{R}^p$ composed of three cosine waves: $\hat{x}[1] = 10$, $\hat{x}[3] = 5$, and $\hat{x}[5] = 2.5$. (a) Output power spectrum over time. The network learns the task by sequentially decomposing $x$ into its Fourier components, acquiring dominant frequencies first. Colored solid lines are gradient descent, black dashed line is AGF run numerically from the same initialization. (b) Model outputs on selected inputs at four training steps, showing progressively accurate reconstructions of the template. (c) Output weight vector $w_i$ for all $H = 18$ neurons and (d) their frequency spectra and dominant phase. Neurons are color-coded by dominant frequency. As predicted by the theory, the neurons group by frequency, while distributing their phase shifts.
  • ...and 7 more figures

Theorems & Definitions (40)

  • Theorem 3.1
  • Conjecture 4.1
  • Theorem 5.1
  • Theorem 5.2
  • Lemma B.1
  • proof
  • Theorem B.2
  • proof
  • Lemma C.1
  • proof
  • ...and 30 more