Weight Decay Improves Language Model Plasticity
Tessa Han, Sebastian Bordt, Hanlin Zhang, Sham Kakade
TL;DR
This work investigates how weight decay during pretraining influences language model plasticity—the ability to adapt to downstream tasks. By systematically varying the weight decay parameter across Llama-2 and OLMo-2 models and evaluating downstream performance after fine-tuning, the authors show that larger weight decay often enhances downstream adaptability even when pretraining loss worsens, highlighting that pretraining loss is not a reliable sole predictor of downstream success. They reveal mechanistic effects of weight decay, including more linearly separable representations, reduced attention-matrix rank, and diminished pretraining overfitting, which together help explain improved plasticity. The study argues for incorporating downstream objectives into hyperparameter optimization and provides a nuanced view of weight decay’s multifaceted role in shaping model behavior across training stages.
Abstract
The prevailing paradigm in large language model (LLM) development is to pretrain a base model, then perform further training to improve performance and model behavior. However, hyperparameter optimization and scaling laws have been studied primarily from the perspective of the base model's validation loss, ignoring downstream adaptability. In this work, we study pretraining from the perspective of model plasticity, that is, the ability of the base model to successfully adapt to downstream tasks through fine-tuning. We focus on the role of weight decay, a key regularization parameter during pretraining. Through systematic experiments, we show that models trained with larger weight decay values are more plastic, meaning they show larger performance gains when fine-tuned on downstream tasks. This phenomenon can lead to counterintuitive trade-offs where base models that perform worse after pretraining can perform better after fine-tuning. Further investigation of weight decay's mechanistic effects on model behavior reveals that it encourages linearly separable representations, regularizes attention matrices, and reduces overfitting on the training data. In conclusion, this work demonstrates the importance of using evaluation metrics beyond cross-entropy loss for hyperparameter optimization and casts light on the multifaceted role of that a single optimization hyperparameter plays in shaping model behavior.
