Rethinking Conventional Wisdom in Machine Learning: From Generalization to Scaling
Lechao Xiao
TL;DR
This work questions the applicability of classic regularization-centric principles in the scaling era of ML, especially for large language model pretraining. It documents phenomena such as scaling-law crossover, where methods beneficial at small scales fail at larger scales, and demonstrates that maximal stable learning rates and small-batch benefits do not universally translate to the scaling regime. Through empirical studies on transformer architectures, it shows that L2 regularization may not improve LM pretraining, that optimal learning rates can be far from stability limits, and that hyperparameter tuning becomes prohibitively costly at scale. The paper emphasizes the need for new guiding principles for scaling and robust model comparison methods that account for crossovers, arguing that extrapolation and simple hyperparameter transfer are insufficient by themselves. Collectively, these findings push for a principled, scale-aware framework to guide scaling decisions and model comparisons in the era of Internet-scale data and computation.
Abstract
The remarkable success of large language pretraining and the discovery of scaling laws signify a paradigm shift in machine learning. Notably, the primary objective has evolved from minimizing generalization error to reducing approximation error, and the most effective strategy has transitioned from regularization (in a broad sense) to scaling up models. This raises a critical question: Do the established principles that proved successful in the generalization-centric era remain valid in this new era of scaling? This paper examines several influential regularization-based principles that may no longer hold true in the scaling-centric, large language model (LLM) era. These principles include explicit L2 regularization and implicit regularization through small batch sizes and large learning rates. Additionally, we identify a new phenomenon termed ``scaling law crossover,'' where two scaling curves intersect at a certain scale, implying that methods effective at smaller scales may not generalize to larger ones. Together, these observations highlight two fundamental questions within this new paradigm: $\bullet$ Guiding Principles for Scaling: If regularization is no longer the primary guiding principle for model design, what new principles are emerging to guide scaling? $\bullet$ Model Comparison at Scale: How to reliably and effectively compare models at the scale where only a single experiment is feasible?
