Table of Contents
Fetching ...

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

TL;DR

The paper tackles data-efficiency in large-language-model pre-training under the Data Wall by proposing OPUS, a principled dynamic data-selection framework that aligns sampling with the optimizer-induced update geometry. OPUS defines optimizer-induced utility, uses Bench-Proxy for a stable in-distribution proxy, and employs Ghost gradients with CountSketch projections plus Boltzmann sampling to achieve scalable, diverse selection with modest overhead. Empirically, OPUS yields compute-efficient gains across GPT-2 L/XL on FineWeb (2.2% average accuracy improvement; up to 8× compute savings on GPT-XL) and demonstrates data-efficiency in continued pre-training of Qwen3-8B-Base on SciencePedia (0.5B tokens vs 3B tokens), indicating substantial practical impact for real-world pre-training. The work advances dynamic data selection by integrating optimizer-aware objectives with scalable estimation, offering a path to more efficient, adaptable pre-training in modern large-scale settings.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

TL;DR

The paper tackles data-efficiency in large-language-model pre-training under the Data Wall by proposing OPUS, a principled dynamic data-selection framework that aligns sampling with the optimizer-induced update geometry. OPUS defines optimizer-induced utility, uses Bench-Proxy for a stable in-distribution proxy, and employs Ghost gradients with CountSketch projections plus Boltzmann sampling to achieve scalable, diverse selection with modest overhead. Empirically, OPUS yields compute-efficient gains across GPT-2 L/XL on FineWeb (2.2% average accuracy improvement; up to 8× compute savings on GPT-XL) and demonstrates data-efficiency in continued pre-training of Qwen3-8B-Base on SciencePedia (0.5B tokens vs 3B tokens), indicating substantial practical impact for real-world pre-training. The work advances dynamic data selection by integrating optimizer-aware objectives with scalable estimation, offering a path to more efficient, adaptable pre-training in modern large-scale settings.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.
Paper Structure (24 sections, 33 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 33 equations, 7 figures, 8 tables, 1 algorithm.

Figures (7)

  • Figure 1: OPUS outperforms random selection by an average of 2.2% accuracy across 10 benchmarks and achieves 8$\times$ reduction in computation on GPT-XL using FineWeb dataset.
  • Figure 2: Comparison of different data selection methods.
  • Figure 3: Overview of OPUS pipeline.
  • Figure 4: Validation-loss curves on GPT-2 XL and GPT-2 Large pre-trained from scratch on FineWeb-Edu dataset. Left: Results on GPT-2 XL. OPUS compared with representative baselines trained on the high-quality pool, with Random 60B shown as a non compute-matched reference. Curves are shown up to 30B update tokens for compute-matched comparison. Right: Results on GPT2-Large.
  • Figure 5: Continued pre-training results on SciencePedia.
  • ...and 2 more figures