Table of Contents
Fetching ...

Three Mechanisms of Feature Learning in a Linear Network

Yizhou Xu, Liu Ziyin

TL;DR

The paper addresses the problem of characterizing both kernel and feature-learning dynamics in finite-width neural networks by providing an exact solution for a minimal two-layer linear model with 1D data. It develops an analytically tractable framework that reduces the gradient-flow dynamics to a one-dimensional system for β=1, yielding explicit expressions and a phase diagram that separates kernel and feature-learning regimes across initialization and hyperparameters. The authors identify three high-signal feature-learning mechanisms—alignment, disalignment, and output rescaling—and show how they emerge only in the feature-learning regime, with empirical validation extending to deeper nonlinear networks. The work offers practical guidance on initialization and learning-rate choices to steer training toward productive feature-learning regimes and provides a bridge between finite-width and infinite-width analyses, with broad implications for understanding and designing training strategies.

Abstract

Understanding the dynamics of neural networks in different width regimes is crucial for improving their training and performance. We present an exact solution for the learning dynamics of a one-hidden-layer linear network, with one-dimensional data, across any finite width, uniquely exhibiting both kernel and feature learning phases. This study marks a technical advancement by enabling the analysis of the training trajectory from any initialization and a detailed phase diagram under varying common hyperparameters such as width, layer-wise learning rates, and scales of output and initialization. We identify three novel prototype mechanisms specific to the feature learning regime: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling, which contrast starkly with the dynamics observed in the kernel regime. Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies.

Three Mechanisms of Feature Learning in a Linear Network

TL;DR

The paper addresses the problem of characterizing both kernel and feature-learning dynamics in finite-width neural networks by providing an exact solution for a minimal two-layer linear model with 1D data. It develops an analytically tractable framework that reduces the gradient-flow dynamics to a one-dimensional system for β=1, yielding explicit expressions and a phase diagram that separates kernel and feature-learning regimes across initialization and hyperparameters. The authors identify three high-signal feature-learning mechanisms—alignment, disalignment, and output rescaling—and show how they emerge only in the feature-learning regime, with empirical validation extending to deeper nonlinear networks. The work offers practical guidance on initialization and learning-rate choices to steer training toward productive feature-learning regimes and provides a bridge between finite-width and infinite-width analyses, with broad implications for understanding and designing training strategies.

Abstract

Understanding the dynamics of neural networks in different width regimes is crucial for improving their training and performance. We present an exact solution for the learning dynamics of a one-hidden-layer linear network, with one-dimensional data, across any finite width, uniquely exhibiting both kernel and feature learning phases. This study marks a technical advancement by enabling the analysis of the training trajectory from any initialization and a detailed phase diagram under varying common hyperparameters such as width, layer-wise learning rates, and scales of output and initialization. We identify three novel prototype mechanisms specific to the feature learning regime: (1) learning by alignment, (2) learning by disalignment, and (3) learning by rescaling, which contrast starkly with the dynamics observed in the kernel regime. Our theoretical findings are substantiated with empirical evidence showing that these mechanisms also manifest in deep nonlinear networks handling real-world tasks, enhancing our understanding of neural network training dynamics and guiding the design of more effective learning strategies.
Paper Structure (24 sections, 7 theorems, 60 equations, 10 figures, 2 tables)

This paper contains 24 sections, 7 theorems, 60 equations, 10 figures, 2 tables.

Key Result

Proposition 1

Let $\tilde{x}=an$, where $a\in \mathbb{R}$ is a random variable and $n$ is a fixed unit vector. Let $x=\sqrt{\mathbb{E}[\tilde{x}^2]}n$ and $y=\frac{\mathbb{E}[\tilde{x}y(\tilde{x})]}{\sqrt{\mathbb{E}[\tilde{x}^2]}}$. Then, the gradient flow of Eq.eq:original_loss equals the gradient flow of

Figures (10)

  • Figure 1: The evolution of $\zeta$ of two-layer networks with different settings. Specifically, we test linear, ReLU, sigmoid, swish, and leaky ReLU activations for both alignment (upper) and disalignment (lower) cases. For the linear network, we show the theoretical predictions obtained from \ref{['eq:angle']} as lines and experimental results as points. The results for nonlinear networks are qualitatively similar.
  • Figure 2: The alignment angle $\zeta$ between different layers of a four-layer FCN with ReLU activation trained on MNIST. (b) shows the final alignment for different initialization scale $\sigma$, while (a) shows training curves corresponding to $\sigma=1$. The dashed lines in (b) show the initial alignment. See Appendix \ref{['app sec: exp']} for experiments on a six-layer network.
  • Figure 3: The initialization scale $\sigma$ correlates negatively with the performance of Resnet-18 on the CIFAR-10 dataset. Left: test accuracy. Here, $\sigma$ is a constant multiplier we apply to the initialized weights of the model under the Kaiming init. Right: the norm of all weights. While all models achieve a 100% training accuracy, models initialized with a larger scale converge to solutions with higher weight norms, which is a sign that the layers are misaligned.
  • Figure 4: A two-layer fully connected ReLU net with $d$ neurons trained on the CIFAR-10 dataset for $10000$ epochs with batch size $128$. The kernel phase is shown in solid lines and the feature learning phase is shown in dashed lines. As the theory predicts, both types of initialization can be turned into either the feature learning or the kernel phase by choosing different combinations of $\gamma$ and $\eta$. Left: the best test accuracy during training. Right: relative distance from the initialization.
  • Figure 5: A two-layer FCN with different initialization scales trained on the CIFAR-10 dataset. We see that finite-width models can also exhibit qualitative differences between the feature learning and the kernel phases when other hyperparameters are scaled toward infinity. Notably, this scaling is different from the lazy training scaling, implying that there are numerous (actually infinitely many) ways for the model to enter the kernel phase, even at a finite width.
  • ...and 5 more figures

Theorems & Definitions (12)

  • Proposition 1
  • Theorem 1
  • Remark
  • Definition 1
  • Theorem 2
  • Corollary 1
  • Corollary 2
  • Corollary 3
  • proof
  • proof
  • ...and 2 more