Table of Contents
Fetching ...

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

Jiapeng Wang, Changxin Tian, Kunlong Chen, Ziqi Liu, Jiaxin Mao, Wayne Xin Zhao, Zhiqiang Zhang, Jun Zhou

TL;DR

MaP addresses the instability in pre-training evaluation of LLMs by decoupling parameter instability from evaluation instability. It introduces a unified framework that merges recent checkpoints to stabilize parameters and employs Pass@k to stabilize measurements, yielding smoother progress curves and more consistent model rankings. Across extensive experiments, MaP demonstrates a synergistic improvement over either component alone, providing a more faithful view of training dynamics and a robust empirical foundation for LLM research. The approach enables more reliable ablations and downstream performance predictions, with clear guidance on hyperparameters and cost trade-offs for practical usage.

Abstract

Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

MaP: A Unified Framework for Reliable Evaluation of Pre-training Dynamics

TL;DR

MaP addresses the instability in pre-training evaluation of LLMs by decoupling parameter instability from evaluation instability. It introduces a unified framework that merges recent checkpoints to stabilize parameters and employs Pass@k to stabilize measurements, yielding smoother progress curves and more consistent model rankings. Across extensive experiments, MaP demonstrates a synergistic improvement over either component alone, providing a more faithful view of training dynamics and a robust empirical foundation for LLM research. The approach enables more reliable ablations and downstream performance predictions, with clear guidance on hyperparameters and cost trade-offs for practical usage.

Abstract

Reliable evaluation is fundamental to the progress of Large Language Models (LLMs), yet the evaluation process during pre-training is plagued by significant instability that obscures true learning dynamics. In this work, we systematically diagnose this instability, attributing it to two distinct sources: \textit{Parameter Instability} from training stochasticity and \textit{Evaluation Instability} from noisy measurement protocols. To counteract both sources of noise, we introduce \textbf{MaP}, a dual-pronged framework that synergistically integrates checkpoint \underline{M}erging \underline{a}nd the \underline{P}ass@k metric. Checkpoint merging smooths the parameter space by averaging recent model weights, while Pass@k provides a robust, low-variance statistical estimate of model capability. Extensive experiments show that MaP yields significantly smoother performance curves, reduces inter-run variance, and ensures more consistent model rankings. Ultimately, MaP provides a more reliable and faithful lens for observing LLM training dynamics, laying a crucial empirical foundation for LLM research.

Paper Structure

This paper contains 33 sections, 17 equations, 5 figures, 6 tables.

Figures (5)

  • Figure 1: Illustrations of evaluation instability during pre-training. (a) When comparing training strategies, performance curves often intersect, obscuring which strategy is truly superior. (b) The performance of a single model can be highly volatile during pre-training, which may conceal underlying issues with the training process. (c) A rank correlation analysis shows a severe mismatch between the rankings of pre-trained models and their fine-tuned counterparts, indicating that pre-training evaluation often fails to reliably predict final downstream performance.
  • Figure 2: Visual comparison of performance trajectories under different stability protocols.
  • Figure 3: Evaluation stability visualization across different benchmarks.
  • Figure 4: Checkpoint merging smooths training trajectories and clarifies model capabilities.
  • Figure 5: Pass@k improves the consistency between pre-training and post-SFT model rankings. (a) With greedy evaluation, the pre-training rank is a poor predictor of post-SFT rank, yielding a Pairwise Ranking Reversal Rate (PRR) of 50%. (b) Using Pass@16 drastically improves consistency, reducing the PRR to 22.73%. We generate $n=16$ samples per problem and calculate the metric for $k=\{1, 2, 4, 8, 16\}$. (c) The reversal proportion decreases monotonically as the sample count $k$ increases.