Table of Contents
Fetching ...

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

Shane Bergsma, Nolan Dey, Joel Hestness

TL;DR

This work introduces the training re-evaluation curve (TREC), a diagnostic that measures how a fully trained model performs on each training batch as a function of when that batch appeared, via $\mathcal{L}_{\mathrm{re}}(t)$. It shows that placing high-quality data at the TREC valley yields the best downstream performance and that TRECs can be predicted in advance from AdamW’s EMA timescale, enabling proactive data curriculums. The authors provide extensive empirical evidence across models from 111M to 3.9B parameters, connect TRECs to the EMA dynamics, and formalize a predictive framework to forecast TRECs under time-varying learning rates with a training-fraction adjustment. They demonstrate practical utility in sparse MoEs, evaluating published LLM recipes, and continual pre-training, though note cross-schedule limitations and the need for schedule-aware predictions. Overall, TREC-guided data placement offers a principled alternative to heuristic late-stage HQ insertions, with broad implications for data selection, curriculum design, and CPT strategies in large-scale language model training.

Abstract

Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Predicting Training Re-evaluation Curves Enables Effective Data Curriculums for LLMs

TL;DR

This work introduces the training re-evaluation curve (TREC), a diagnostic that measures how a fully trained model performs on each training batch as a function of when that batch appeared, via . It shows that placing high-quality data at the TREC valley yields the best downstream performance and that TRECs can be predicted in advance from AdamW’s EMA timescale, enabling proactive data curriculums. The authors provide extensive empirical evidence across models from 111M to 3.9B parameters, connect TRECs to the EMA dynamics, and formalize a predictive framework to forecast TRECs under time-varying learning rates with a training-fraction adjustment. They demonstrate practical utility in sparse MoEs, evaluating published LLM recipes, and continual pre-training, though note cross-schedule limitations and the need for schedule-aware predictions. Overall, TREC-guided data placement offers a principled alternative to heuristic late-stage HQ insertions, with broad implications for data selection, curriculum design, and CPT strategies in large-scale language model training.

Abstract

Data curriculums have become central to successful LLM training, yet principles governing optimal data placement remain unclear. We introduce the *training re-evaluation curve (TREC)*, a diagnostic that retrospectively evaluates training batches *using the final model weights*. The TREC characterizes how well a trained model retains training data as a function of *when* the data was encountered during training. Analyzing TRECs for models from 111M to 3.9B parameters, we show that placing high-quality data at low points on the TREC significantly improves performance. Importantly, while a TREC is initially observable only after training, we demonstrate it can be *predicted in advance* from AdamW's implicit EMA coefficients, enabling proactive curriculum design. By predicting TRECs for published training recipes, we explain prior ablations and reveal suboptimal data placements. We also align high-quality data with TREC minima in order to improve continual pre-training of a 3.9B-parameter LLM trained on 900B tokens.

Paper Structure

This paper contains 86 sections, 34 equations, 27 figures, 7 tables.

Figures (27)

  • Figure 1: $\text{Left}$: (610M params, learning rate drop at 70%): While train loss steadily falls, optimal high-quality (HQ) data placement is in TREC valley, not at end. $\text{Middle}$: (610M, linear LR decay): TREC shape varies with AdamW timescale $\tau$ (varied via weight decay $\lambda$). $\text{Right}$: (size varies, linear LR decay, 20 TPP): TRECs align across 1000$\times$ scaling of training compute, when $\tau$ matches.
  • Figure 1: SlimPajama mixes used in \ref{['fig:placements']} placement tests: General blend is original distribution, code blend is HQ data.
  • Figure 2: TRECs predict best placement.$\text{Left}$: Example placement curriculum 5-of-10 for $\hbox{Step}$ decay. $\text{Right}$: Results for $\hbox{10}\times$.
  • Figure 3: Timescale $\tau$ determines TREC shape (610M, 80 TPP). Sweeping $\eta$ ($\text{left}$), $\lambda$ ($\text{middle}$), or $B$ ($\text{right}$) produces matching variations in TRECs when $\tau$ (\ref{['eq:tema']}) varies identically.
  • Figure 4: Timescale $\tau$ and TREC shape across model/dataset scales.$\text{Left}$: Similar $\tau$ yields similar TREC shapes across model scales when training to 20 TPP ($\tau \approx 0.3$). At 111M ($\text{middle}$, $\tau = 0.021$) and 610M ($\text{right}$, $\tau = 0.105$), increasing TPP shifts TRECs slightly right.
  • ...and 22 more figures

Theorems & Definitions (2)

  • Definition 1: TREC
  • Definition 2: High-Quality Data