Table of Contents
Fetching ...

Language Models Improve When Pretraining Data Matches Target Tasks

David Mizrahi, Anders Boesen Lindbo Larsen, Jesse Allardice, Suzie Petryk, Yuri Gorokhov, Jeffrey Li, Alex Fang, Josh Gardner, Tom Gunter, Afshin Dehghan

TL;DR

This work demonstrates that language model capabilities can be substantially improved by explicitly aligning pretraining data with target benchmarks using BETR. By embedding benchmark examples with a sample of documents, scoring by similarity, and training a lightweight predictor to extend scores to the full corpus, BETR achieves consistent 1.8–2.8x compute multipliers over strong baselines across scales, and up to 4.7x over unfiltered data. BETR enables both specialist (Target-Core) and generalist (Target-Noncore) models, revealing trade-offs in capability coverage and illustrating how benchmark choices shape model behavior. Scaling-law analyses show that optimal data filtering becomes less aggressive as model scale increases, and that task-level benefits from data selection vary widely, underscoring the need for scale-aware data strategies. Overall, the results emphasize that explicit benchmark-driven data selection is a practical and informative lever for shaping capabilities in large language models, with clear implications for data curation practices and evaluation design.

Abstract

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning $10^{19}$ to $10^{22}$ FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.

Language Models Improve When Pretraining Data Matches Target Tasks

TL;DR

This work demonstrates that language model capabilities can be substantially improved by explicitly aligning pretraining data with target benchmarks using BETR. By embedding benchmark examples with a sample of documents, scoring by similarity, and training a lightweight predictor to extend scores to the full corpus, BETR achieves consistent 1.8–2.8x compute multipliers over strong baselines across scales, and up to 4.7x over unfiltered data. BETR enables both specialist (Target-Core) and generalist (Target-Noncore) models, revealing trade-offs in capability coverage and illustrating how benchmark choices shape model behavior. Scaling-law analyses show that optimal data filtering becomes less aggressive as model scale increases, and that task-level benefits from data selection vary widely, underscoring the need for scale-aware data strategies. Overall, the results emphasize that explicit benchmark-driven data selection is a practical and informative lever for shaping capabilities in large language models, with clear implications for data curation practices and evaluation design.

Abstract

Every data selection method inherently has a target. In practice, these targets often emerge implicitly through benchmark-driven iteration: researchers develop selection strategies, train models, measure benchmark performance, then refine accordingly. This raises a natural question: what happens when we make this optimization explicit? To explore this, we propose benchmark-targeted ranking (BETR), a simple method that selects pretraining documents based on similarity to benchmark training examples. BETR embeds benchmark examples and a sample of pretraining documents in a shared space, scores this sample by similarity to benchmarks, then trains a lightweight classifier to predict these scores for the full corpus. We compare data selection methods by training over 500 models spanning to FLOPs and fitting scaling laws to them. From this, we find that simply aligning pretraining data to evaluation benchmarks using BETR achieves a 2.1x compute multiplier over DCLM-Baseline (4.7x over unfiltered data) and improves performance on 9 out of 10 tasks across all scales. BETR also generalizes well: when targeting a diverse set of benchmarks disjoint from our evaluation suite, it still matches or outperforms baselines. Our scaling analysis further reveals a clear trend: larger models require less aggressive filtering. Overall, our findings show that directly matching pretraining data to target tasks precisely shapes model capabilities and highlight that optimal selection strategies must adapt to model scale.

Paper Structure

This paper contains 53 sections, 5 equations, 26 figures, 13 tables.

Figures (26)

  • Figure 1: Benchmark-targeted ranking (BETR) achieves a 1.8x--2.8x compute multiplier over strong baselines. Scaling curves show accuracy on Core (10 standard benchmarks) at compute-optimality from $10^{19}$ to $10^{22}$ FLOPs. Target-Core directly optimizes for evaluated benchmarks, while Target-Noncore targets distinct benchmarks. Both outperform DCLM-Baseline at all scales.
  • Figure 2: BETR method overview. We embed benchmark examples and a small sample of pretraining documents ($\sim$0.1% of pool) in a shared space, score the sampled documents by their similarity to benchmarks, then train a classifier on these scores to efficiently rank and filter the entire document pool.
  • Figure 3: BETR scoring distributions in practice.Left: Best rank assigned by any benchmark example (BE). With >14,000 benchmark examples competing to score documents, only those ranked in the top 0.002% by some benchmark example reach the top 10% of BETR scores. Right: Cosine similarity to the benchmark example that assigned the best rank. Even top-ranked documents show only moderate similarities ($\sim$0.5) to benchmark examples, highlighting the limited overlap between benchmark examples and web text. Shown for our default (i.e. "in practice") Target-Core settings on DCLM-RefinedWeb.
  • Figure 3: Benchmark vs. domain targeting. Targeting diverse benchmarks (Noncore) outperforms targeting specific domains on held-out tasks (Core).
  • Figure 4: Comparing datasets using scaling laws. For each dataset, we use a two-stage scaling law approach to predict per-benchmark accuracy gadre2024languagemeta2024llama3bhagia2024establishing. 1) Fit a loss scaling law as a function of model size and training tokens (color coding indicates training FLOPs, darker blues for more compute used). 2) Map task loss to accuracy for each benchmark. 3) Combine the two to predict accuracy at any configuration, including compute-optimal settings. 4) Repeat for each dataset to compare performance across compute budgets.
  • ...and 21 more figures