Lazy vs hasty: linearization in deep networks impacts learning schedule based on example difficulty
Thomas George, Guillaume Lajoie, Aristide Baratin
TL;DR
The paper compares lazy (kernel-like) training with nonlinear feature learning in deep networks by introducing a tunable alpha that interpolates between regimes. It combines empirical studies on toy data, CIFAR10, and spurious-correlation tasks with a tractable quadratic model to show that nonlinear training prioritizes easy examples early on, yielding faster learning for those groups while often delaying harder or noisier ones. The theoretical analysis reveals that linearization preserves per-mode convergence times while nonlinear dynamics induce sequential learning across modes, aligning with a simplicity bias and curriculum-like learning. These findings illuminate why deep networks can generalize well and suggest designing training schedules that leverage the natural ordering of example difficulty. The work thus provides a nuanced understanding of how representation learning interacts with data difficulty beyond kernel-based explanations.
Abstract
Among attempts at giving a theoretical account of the success of deep neural networks, a recent line of work has identified a so-called lazy training regime in which the network can be well approximated by its linearization around initialization. Here we investigate the comparative effect of the lazy (linear) and feature learning (non-linear) regimes on subgroups of examples based on their difficulty. Specifically, we show that easier examples are given more weight in feature learning mode, resulting in faster training compared to more difficult ones. In other words, the non-linear dynamics tends to sequentialize the learning of examples of increasing difficulty. We illustrate this phenomenon across different ways to quantify example difficulty, including c-score, label noise, and in the presence of easy-to-learn spurious correlations. Our results reveal a new understanding of how deep networks prioritize resources across example difficulty.
