Table of Contents
Fetching ...

Instability in Downstream Task Performance During LLM Pretraining

Yuto Nishida, Masaru Isonuma, Yusuke Oda

TL;DR

The paper addresses the problem that downstream task performance fluctuates during LLM pretraining, hindering reliable evaluation and model selection. It analyzes instability across model sizes and task categories, introducing mean total variation (MTV) and an instability score (IS) to quantify fluctuations. The authors propose post-hoc checkpoint integration—checkpoint averaging and checkpoint ensemble—as training-free mitigations, supported by a theoretical justification that averaging reduces variability. Empirical results show these methods reduce task- and example-level instability and often improve mean performance, highlighting checkpoint averaging as a cost-effective stabilizer and ensemble as a stronger stabilizer at the cost of inference overhead. This work offers a practical approach to improve evaluation robustness during LLM pretraining and suggests directions for integrating uncertainty-aware or adaptive stabilization strategies in future research.

Abstract

When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

Instability in Downstream Task Performance During LLM Pretraining

TL;DR

The paper addresses the problem that downstream task performance fluctuates during LLM pretraining, hindering reliable evaluation and model selection. It analyzes instability across model sizes and task categories, introducing mean total variation (MTV) and an instability score (IS) to quantify fluctuations. The authors propose post-hoc checkpoint integration—checkpoint averaging and checkpoint ensemble—as training-free mitigations, supported by a theoretical justification that averaging reduces variability. Empirical results show these methods reduce task- and example-level instability and often improve mean performance, highlighting checkpoint averaging as a cost-effective stabilizer and ensemble as a stronger stabilizer at the cost of inference overhead. This work offers a practical approach to improve evaluation robustness during LLM pretraining and suggests directions for integrating uncertainty-aware or adaptive stabilization strategies in future research.

Abstract

When training large language models (LLMs), it is common practice to track downstream task performance throughout the training process and select the checkpoint with the highest validation score. However, downstream metrics often exhibit substantial fluctuations, making it difficult to identify the checkpoint that truly represents the best-performing model. In this study, we empirically analyze the stability of downstream task performance in an LLM trained on diverse web-scale corpora. We find that task scores frequently fluctuate throughout training, both at the aggregate and example levels. To address this instability, we investigate two post-hoc checkpoint integration methods: checkpoint averaging and ensemble, motivated by the hypothesis that aggregating neighboring checkpoints can reduce performance volatility. We demonstrate both empirically and theoretically that these methods improve downstream performance stability without requiring any changes to the training procedure.

Paper Structure

This paper contains 45 sections, 7 equations, 26 figures, 3 tables.

Figures (26)

  • Figure 1: Overview of stabilizing downstream task performance during pretraining. We observe instability in task performance during LLM pretraining, and experimentally show that theoretically motivated checkpoint integration methods improve evaluation stability.
  • Figure 2: EL (Entity Linking)
  • Figure 3: FA (Fundamental Analysis)
  • Figure 4: HE (Human Examination)
  • Figure 5: MC (Multiple Choice QA)
  • ...and 21 more figures