Table of Contents
Fetching ...

Overtrained Language Models Are Harder to Fine-Tune

Jacob Mitchell Springer, Sachin Goyal, Kaiyue Wen, Tanishq Kumar, Xiang Yue, Sadhika Malladi, Graham Neubig, Aditi Raghunathan

TL;DR

This work challenges the assumption that more pre-training data universally improves downstream performance by revealing catastrophic overtraining, where extended pre-training harms post-training results. It establishes a dual approach: extensive empirical studies showing real-world degradation after instruction and multimodal fine-tuning, plus controlled linear-model analyses that reveal progressive sensitivity to parameter updates. The authors formalize the phenomenon in a two-layer linear transfer-learning framework, proving inflection points and the inevitability of degradation without regularization, while also showing how learning-rate strategies and regularization can delay but not always eliminate the effect. The findings call for a reevaluation of pre-training strategies and highlight the need to balance base-model gains with downstream adaptability in practical deployments.

Abstract

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Overtrained Language Models Are Harder to Fine-Tune

TL;DR

This work challenges the assumption that more pre-training data universally improves downstream performance by revealing catastrophic overtraining, where extended pre-training harms post-training results. It establishes a dual approach: extensive empirical studies showing real-world degradation after instruction and multimodal fine-tuning, plus controlled linear-model analyses that reveal progressive sensitivity to parameter updates. The authors formalize the phenomenon in a two-layer linear transfer-learning framework, proving inflection points and the inevitability of degradation without regularization, while also showing how learning-rate strategies and regularization can delay but not always eliminate the effect. The findings call for a reevaluation of pre-training strategies and highlight the need to balance base-model gains with downstream adaptability in practical deployments.

Abstract

Large language models are pre-trained on ever-growing token budgets under the assumption that better pre-training performance translates to improved downstream models. In this work, we challenge this assumption and show that extended pre-training can make models harder to fine-tune, leading to degraded final performance. We term this phenomenon catastrophic overtraining. For example, the instruction-tuned OLMo-1B model pre-trained on 3T tokens leads to over 2% worse performance on multiple standard LLM benchmarks than its 2.3T token counterpart. Through controlled experiments and theoretical analysis, we show that catastrophic overtraining arises from a systematic increase in the broad sensitivity of pre-trained parameters to modifications, including but not limited to fine-tuning. Our findings call for a critical reassessment of pre-training design that considers the downstream adaptability of the model.

Paper Structure

This paper contains 57 sections, 29 theorems, 98 equations, 65 figures, 6 tables.

Key Result

Theorem 4.1

There exists a sequence of timesteps $t_1<\hdots < t_i < \hdots t_d$ such that at timestep $t_i$,

Figures (65)

  • Figure 1: Language models with extensive pre-training can exhibit catastrophic overtraining, where the performance of post-trained models degrades as the pre-training stage is extended. We report the average performance of five common LLM benchmarks (ARC-Easy, ARC-Challenge, PIQA, HellaSwag) for OLMo-1B intermediate checkpoints before and after instruction fine-tuning, with additional results in Section \ref{['sec:experiments']}. We argue that catastrophic overtraining arises as a result of a progressive increase throughout pre-training of model sensitivity to parameter transformations, leading to greater forgetting of the capabilities acquired during pre-training after fine-tuning (Section \ref{['sec:experiments_controlled']}). Overall, our results challenge the notion that scaling pre-training is strictly beneficial.
  • Figure 2: Extending pre-training can degrade performance after fine-tuning on Anthropic-HH (left) and LLaVA (right). We consider fine-tuning on various intermediate checkpoints from OLMo-1B pre-training. While the base model performance (before fine-tuning) improves with the pre-training token budget (black dashed curve), the performance after fine-tuning drops as we pre-train on more tokens. In the instruction-tuning setting (left), we observe degradation on the ID task (green)---AlpacaEval---as well as on OOD benchmarks (blue)---ARC, PIQA, and HellaSwag. In the multimodal tuning setting, we observe degradation with overtraining on PIQA, and a larger gap between the fine-tuned and base model for ARC, HellaSwag, and Winogrande. We report average over three independent fine-tuning runs, plus error bars. Refer to Appendix \ref{['app:ift_omitted_figures']} for additional models (OLMo-2-7B, LLM360-Amber) and instruction-tuning datasets (extended results for Anthropic-HH, TULU).
  • Figure 3: Progressive sensitivity of Gaussian perturbations (left): extending pre-training progressively increases the degree to which a Gaussian parameter perturbation degrades perplexity. Catastrophic overtraining (right): eventually, this leads to overall worse pre-training perplexity. We perturb OLMo-30M models trained on various pre-training token budgets with Gaussian noise scaled by the factor $\gamma$ (color). The left plot shows the difference in perplexity between the perturbed and unperturbed models, while the right plot shows the absolute perplexity of the perturbed models.
  • Figure 4: Progressive sensitivity of fine-tuning: Extending pre-training progressively increases the degree to which fine-tuning degrades perplexity. OLMo-30M models trained on various pre-training token budgets are fine-tuned on downstream tasks using fixed hyperparameters: math (GSM8k), code (Starcoder-Python), and QA (SIQA). Lines connect models sharing hyperparameters, differing only in pre-training tokens. Learning rates range from 4e-06 to the dataset-specific maximum ($\eta_{\mathrm{max}}$). We report the difference in perplexity between the fine-tuned and pre-trained models, as a function of the number of pre-training tokens.
  • Figure 5: Catastrophic overtraining for fine-tuning with fixed hyperparameters: extending pre-training can lead to an overall increase in the C4 perplexity (top), and ID perplexity (fine-tuning task; bottom), when fine-tuning with fixed hyperparameters. OLMo-30M models pre-trained with varying token budgets are fine-tuned on downstream tasks using fixed hyperparameters: math (GSM8k), code (Starcoder-Python), QA (SIQA), and classification (MR, RTE, TREC). Lines connect models sharing hyperparameters, differing only in pre-training tokens. Learning rates range from 4e-06 to the dataset-specific maximum ($\eta_{\mathrm{max}}$). At sufficiently large learning rates (lighter colors), we observe performance degradation in both ID and pre-training metrics beyond certain pre-training budgets. (See Appendices \ref{['app:controlled_experimental_details']} and \ref{['app:controlled_omitted_figures']} for ablations.)
  • ...and 60 more figures

Theorems & Definitions (55)

  • Theorem 4.1: Informal statement of Andrew1810gidel2019implicitregularizationdiscretegradient
  • Definition 4.2: Inflection point
  • Proposition 4.3: Informal version of \ref{['lem:gaussian_perturbations']}
  • Theorem 4.4: Informal version of \ref{['app_thm:gaussian']}
  • Definition 4.5
  • Theorem 4.6: Progressive sensitivity; informal version of \ref{['app_thm:prog_sensitivity']}
  • proof : Proof sketch
  • Theorem 4.7: Catastrophic overtraining; informal version of Theorem \ref{['app_thm:sensitivity_overtraining']}
  • proof : Proof sketch
  • Theorem 1.1: Theorem 1 of gidel2019implicitregularizationdiscretegradient
  • ...and 45 more