Table of Contents
Fetching ...

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

Thomas George, Guillaume Lajoie, Aristide Baratin

TL;DR

The paper compares lazy (kernel-like) training with nonlinear feature learning in deep networks by introducing a tunable alpha that interpolates between regimes. It combines empirical studies on toy data, CIFAR10, and spurious-correlation tasks with a tractable quadratic model to show that nonlinear training prioritizes easy examples early on, yielding faster learning for those groups while often delaying harder or noisier ones. The theoretical analysis reveals that linearization preserves per-mode convergence times while nonlinear dynamics induce sequential learning across modes, aligning with a simplicity bias and curriculum-like learning. These findings illuminate why deep networks can generalize well and suggest designing training schedules that leverage the natural ordering of example difficulty. The work thus provides a nuanced understanding of how representation learning interacts with data difficulty beyond kernel-based explanations.

Abstract

Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called lazy training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.

Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty

TL;DR

The paper compares lazy (kernel-like) training with nonlinear feature learning in deep networks by introducing a tunable alpha that interpolates between regimes. It combines empirical studies on toy data, CIFAR10, and spurious-correlation tasks with a tractable quadratic model to show that nonlinear training prioritizes easy examples early on, yielding faster learning for those groups while often delaying harder or noisier ones. The theoretical analysis reveals that linearization preserves per-mode convergence times while nonlinear dynamics induce sequential learning across modes, aligning with a simplicity bias and curriculum-like learning. These findings illuminate why deep networks can generalize well and suggest designing training schedules that leverage the natural ordering of example difficulty. The work thus provides a nuanced understanding of how representation learning interacts with data difficulty beyond kernel-based explanations.

Abstract

Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called lazy training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.
Paper Structure (41 sections, 1 theorem, 28 equations, 14 figures)

This paper contains 41 sections, 1 theorem, 28 equations, 14 figures.

Key Result

Proposition 1

The solution of (eq:GDflownonlin, eq:GDtheta_f) is given by, By contrast, the solution in the linearized regime where $\bm \Sigma(t) \approx \bm \Sigma(0)$ is,

Figures (14)

  • Figure 1: 100 randomly initialized runs of a 4 layers MLP trained on the yin-yang dataset (a) using gradient descent in both the non-linear ($\alpha=1$) and linearized ($\alpha=100$) setting. The training losses (b) show a speed-up in the non-linear regime: in order to compare both regimes at equal progress, we normalize by comparing models extracted at equal training loss thresholds (c), (d) and (e). We visualize the differences $\Delta\text{loss}\left(x_{\text{test}}\right)=\text{loss}f_{\text{non-linear}}\left(x_{\text{test}}\right)-\text{loss}f_{\text{linear}}\left(x_{\text{test}}\right)$ for test points paving the 2d square $\left[-1,1\right]^{2}$ using a color scale. We observe that these differences are not uniformly spread across examples: instead they suggest a comparative bias of the non-linear regime towards correctly classifying easy examples (large areas of the same class), whereas difficult examples (e.g. the small disks) are boosted in the linear regime.
  • Figure 2: Starting from the same initial parameters, we train 2 ResNet18 models with $\alpha=1$ (standard training) and $\alpha=100$ (linearized training) on CIFAR10 using SGD with momentum. (Top left) We compute the training loss separately on 10 subgroups of examples ranked by their C-scores. Training progress is normalized by the mean training loss on the $x$-axis. Unsurprisingly, in both regimes examples with high C-scores are learned faster. Remarkably, this ranking is more pronounced in the non-linear regime as can be observed by comparing dashed and solid lines of the same color. (Bottom left) We randomly flip the class of 15% of the training examples. At equal progress (measured by equal clean examples loss), the non-linear regime prioritizes learning clean examples and nearly ignores noisy examples compared to the linear regime since the solid curve remains higher for the non-linear regime. Concomitantly, the non-linear test loss reaches a lower value. (Right) On the same training run, as a sanity check we observe that the $\alpha=100$ training run remains in the linear regime throughout since all metrics stay close to $1$, whereas in the $\alpha=1$ run, the NTK and representation kernel rotate, and a large part of ReLU signs are flipped. These experiments are completed in Appendix \ref{['Appsec:add_exp']} with accuracy plots for the same experiments, and with other experiments with varying initial model parameters and mini-batch order.
  • Figure 3: We visualize the trajectories of training runs on 2 spurious correlations setups, by computing the accuracy on 2 separate subsets: one with examples that contain the spurious feature (with spurious), the other one without spurious correlations (w/o spurious). On Celeb A (top row), the attribute 'blond' is spuriously correlated with the gender 'woman'. In the first phase of training we observe that (left) the test accuracy is essentially higher for the linear run, which can be further explained by observing that (middle) the training accuracy for w/o spurious examples increases faster in the linear regime than in non-linear regimes at equal with spurious training accuracy. (right) A similar trend holds for test examples. In this first part the linear regime is less sensitive to the spurious correlation (easy examples) thus gets better robustness. (bottom row) On Waterbirds, the background (e.g. a lake) is spuriously correlated with the label (e.g. a water bird). (left) We observe the same hierarchy between the linear run and other runs. In the first training phase, the linear regime is less prone to learning the spurious correlation: the w/o spurious accuracy stays higher while the with spurious examples are learned ((middle) and (right)). These experiments are completed in fig. \ref{['fig:celeba_runs']} in Appendix \ref{['Appsec:add_exp']} with varying initial model parameters and mini-batch order.
  • Figure 4: (left) Different input/label correlation (example 1): examples are learned in a flipped order in the two regimes. (middle) Label noise (example 2): the non-linear dynamics prioritizes learning the clean labels (right) Spurious correlations (example 3): the non-linear dynamics prioritizes learning the spuriously correlated feature. These analytical curves are completed with numerical experiments on standard (dense) 2-layer MLP in figure \ref{['fig:analytical_examples_numerical']} in appendix \ref{['Appsec:analytical_numerical']}, which shows a similar qualitative behaviour.
  • Figure 5: same as fig. \ref{['fig:cifar_cscores_noisy']}, with $\alpha=1$ and varying learning rates in $\{0.01,0.003,10^{-4},10^{-6}\}$. In this experiment, we rule out the role of the learning rate in learning speed of easy/difficult examples, since regardless of the learning rate, all runs follow the same trajectory as measured by noisy examples accuracy during training, and test examples accuracy during training. This shows that modulating the learning rate plays a different role as the $\alpha$ scaling.
  • ...and 9 more figures

Theorems & Definitions (1)

  • Proposition 1