Table of Contents
Fetching ...

Improving Pretraining Data Using Perplexity Correlations

Tristan Thrush, Christopher Potts, Tatsunori Hashimoto

TL;DR

This work tackles the challenge of selecting high-quality pretraining data without expensive model retraining. It introduces a correlation-based framework that uses perplexity signals from publicly available LLMs to rank data domains and construct a nonnegative sampling distribution, grounded in a high-dimensional single-index model. The authors develop a robust, rank-based estimator to link per-domain losses to downstream benchmark performance and show that a convex projection yields data mixes that improve downstream results, outperforming DSIR and matching leading hand-crafted methods, with gains that scale with model size. The approach is scalable, reproducible, and accompanied by a public-pipeline (pip package), enabling practitioners to exploit public-model perplexities for data curation at scale.

Abstract

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier. We have now also updated this paper to include results from preregistered experiments with new pretraining data on an aggregation of 22 benchmarks up to the 1.4B scale, showing increasing improvements of our method over others with more scale. A pip package with full documentation can be found here: https://github.com/TristanThrush/perplexity-correlations.

Improving Pretraining Data Using Perplexity Correlations

TL;DR

This work tackles the challenge of selecting high-quality pretraining data without expensive model retraining. It introduces a correlation-based framework that uses perplexity signals from publicly available LLMs to rank data domains and construct a nonnegative sampling distribution, grounded in a high-dimensional single-index model. The authors develop a robust, rank-based estimator to link per-domain losses to downstream benchmark performance and show that a convex projection yields data mixes that improve downstream results, outperforming DSIR and matching leading hand-crafted methods, with gains that scale with model size. The approach is scalable, reproducible, and accompanied by a public-pipeline (pip package), enabling practitioners to exploit public-model perplexities for data curation at scale.

Abstract

Quality pretraining data is often seen as the key to high-performance language models. However, progress in understanding pretraining data has been slow due to the costly pretraining runs required for data selection experiments. We present a framework that avoids these costs and selects high-quality pretraining data without any LLM training of our own. Our work is based on a simple observation: LLM losses on many pretraining texts are correlated with downstream benchmark performance, and selecting high-correlation documents is an effective pretraining data selection method. We build a new statistical framework for data selection centered around estimates of perplexity-benchmark correlations and perform data selection using a sample of 90 LLMs taken from the Open LLM Leaderboard on texts from tens of thousands of web domains. In controlled pretraining experiments at the 160M parameter scale on 8 benchmarks, our approach outperforms DSIR on every benchmark, while matching the best data selector found in DataComp-LM, a hand-engineered bigram classifier. We have now also updated this paper to include results from preregistered experiments with new pretraining data on an aggregation of 22 benchmarks up to the 1.4B scale, showing increasing improvements of our method over others with more scale. A pip package with full documentation can be found here: https://github.com/TristanThrush/perplexity-correlations.
Paper Structure (46 sections, 8 theorems, 106 equations, 11 figures, 3 tables, 1 algorithm)

This paper contains 46 sections, 8 theorems, 106 equations, 11 figures, 3 tables, 1 algorithm.

Key Result

Proposition 1

Suppose that $\bm{{\bf{\theta}^*}}$ weights are non-negative. Then, for models with associated likelihoods $\mathbf{x} \in \mathcal{X}\subset \mathbb{R}^{D}$, the minimizer of the pretraining loss over the $\bm{{\bf{\theta}^*}}$ sampling distribution $\mathbb{E}_{j\sim \bm{{\bf{\theta}^*}}}[x_j]$ al

Figures (11)

  • Figure 1: We pretrain on domains where lower loss is generally correlated with higher downstream performance. Our approach does this by taking public, pretrained LLMs and measuring correlations across their log-likelihoods (left, red matrix) and performance on a target benchmark (center, blue vector). We then perform data selection by training a fastText classifier that distinguishes high correlation domains from others. This approach is on par with the best-known data selection methods in our experiments, despite requiring no human selection of high-quality domains.
  • Figure 2: Pretraining results with different data selection methods. Each row is an LLM, and each column is a task. The number in the upper left indicates the ranking of the method when targeting that benchmark compared to other methods (lower is better). Numbers within the heatmap denote accuracy for all benchmarks except the LAMBADA tasks for which the values are log perplexities (where lower scores are better). We find that our approach appropriately optimizes data mixes for the target language and benchmark, and matches the fastText baseline across most benchmarks.
  • Figure 3: Preregistered experiment results. We did not see a benefit from using perplexity correlations when the dataset is already extensively filtered, but saw large consistent benefits otherwise, with the benefits increasing with scale. For the pre-filtered pool, the largest correlation coefficient was $.33$ and the smallest was $.23$ with the vast majority of domains being over $.29$, so we could have predicted no or small gains before pretraining. In the raw pool for DCLM Core, the largest coefficient was $.32$ and the smallest was $-.07$. Pre-filtered pool results for Non-EN LAMBADA are not shown because there is only English in the pre-filtered pool. See Appendix \ref{['app:preregistered_experiments']} for more details.
  • Figure 4: Rank predictions given by $\langle\hat{\bm{\theta}}^{\textnormal{proj}},\Phi(\mathbf{x})\rangle$ for PIQA and LAMBADA FR. A standard deviation ($\sigma$) from the ideal fit is shown in red. $2\sigma$ is shown in orange. Many models outside $2\sigma$ (shown in blue) are trained on atypical data such as multilingual data, code, or GPT-4 gpt outputs. Models with atypical architectures (i.e. Mamba; mamba) are shown in black. Generally, our estimate tightly predicts ordinal benchmark performance from web corpus losses.
  • Figure 5: Pretraining results for different methods within our paradigm. Overall, we see that many rank-correlation pretraining data selection approaches perform well.
  • ...and 6 more figures

Theorems & Definitions (8)

  • Proposition 1
  • Theorem 1
  • Corollary 1
  • Theorem 2
  • Theorem 2
  • Corollary 1
  • Theorem 2
  • Theorem 3