Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Lechao Xiao

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Lechao Xiao

TL;DR

This work questions the applicability of classic regularization-centric principles in the scaling era of ML, especially for large language model pretraining. It documents phenomena such as scaling-law crossover, where methods beneficial at small scales fail at larger scales, and demonstrates that maximal stable learning rates and small-batch benefits do not universally translate to the scaling regime. Through empirical studies on transformer architectures, it shows that L2 regularization may not improve LM pretraining, that optimal learning rates can be far from stability limits, and that hyperparameter tuning becomes prohibitively costly at scale. The paper emphasizes the need for new guiding principles for scaling and robust model comparison methods that account for crossovers, arguing that extrapolation and simple hyperparameter transfer are insufficient by themselves. Collectively, these findings push for a principled, scale-aware framework to guide scaling decisions and model comparisons in the era of Internet-scale data and computation.

Abstract

The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $\bullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $\bullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

TL;DR

Abstract

Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling?

Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?

Paper Structure (39 sections, 6 equations, 15 figures)

This paper contains 39 sections, 6 equations, 15 figures.

Introduction
Background: Two Paradigms in Machine Learning
The Bias-Variance Trade-off and the U-shape Regime
Over-parameterization and the Second-descent Regime
Large Learning Rate is Better.
Small Batch Size is Better keskar2016largesmith2017bayesian.
Heavy Under-parameterization and the Skydiving Regime
Architecture and Optimizer
Optimizer.
Is Regularization Needed?
Does L2 Regularization Improve Performance?
Discussion.
Does Maximal Stable Learning Rate Perform Better?
Experiment Setup.
Experimental Results.
...and 24 more sections

Figures (15)

Figure 1: A proposal to reconcile "Classical" Machine Learning (U-shape), ImageNet-scale Deep Learning (Second-descent) and Internet-scale Deep Learning (Skydiving).
Figure 2: Generalization vs Scaling Paradigms.
Figure 3: U-Shaped (a) and double-descent curves (b). Figure is from belkin2019reconciling
Figure 4: Learning Dynamics: Generalization (Image Classification) vs. Scaling (Language Model Pretraining).(a) ResNet-18 on CIFAR-10. Training and test error curves initially overlap, then diverge, forming a generalization gap. Minimizing this gap is the central objective as the network easily interpolates training data. (b). Decoder-only transformer on C4. Evaluation curves consistently remain within training curves throughout training, even when the model size and compute is scaled up by a factor of 500 and 250,000, respectively.
Figure 5: Training dynamics of four transformer models. From left to right: no L2 and no weight decay, small L2 and no weight decay, no L2 but with weight decay, with both L2 and weight decay.
...and 10 more figures

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

TL;DR

Abstract

Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling

Authors

TL;DR

Abstract

Table of Contents

Figures (15)