OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang; Xuan Ouyang; Tianyi Xu; Yuzheng Hu; Jialin Liu; Guo Chen; Tianyu Zhang; Junhao Zheng; Kexin Yang; Xingzhang Ren; Dayiheng Liu; Linfeng Zhang

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Shaobo Wang, Xuan Ouyang, Tianyi Xu, Yuzheng Hu, Jialin Liu, Guo Chen, Tianyu Zhang, Junhao Zheng, Kexin Yang, Xingzhang Ren, Dayiheng Liu, Linfeng Zhang

TL;DR

The paper tackles data-efficiency in large-language-model pre-training under the Data Wall by proposing OPUS, a principled dynamic data-selection framework that aligns sampling with the optimizer-induced update geometry. OPUS defines optimizer-induced utility, uses Bench-Proxy for a stable in-distribution proxy, and employs Ghost gradients with CountSketch projections plus Boltzmann sampling to achieve scalable, diverse selection with modest overhead. Empirically, OPUS yields compute-efficient gains across GPT-2 L/XL on FineWeb (2.2% average accuracy improvement; up to 8× compute savings on GPT-XL) and demonstrates data-efficiency in continued pre-training of Qwen3-8B-Base on SciencePedia (0.5B tokens vs 3B tokens), indicating substantial practical impact for real-world pre-training. The work advances dynamic data selection by integrating optimizer-aware objectives with scalable estimation, offering a path to more efficient, adaptable pre-training in modern large-scale settings.

Abstract

As high-quality public text approaches exhaustion, a phenomenon known as the Data Wall, pre-training is shifting from more tokens to better tokens. However, existing methods either rely on heuristic static filters that ignore training dynamics, or use dynamic yet optimizer-agnostic criteria based on raw gradients. We propose OPUS (Optimizer-induced Projected Utility Selection), a dynamic data selection framework that defines utility in the optimizer-induced update space. OPUS scores candidates by projecting their effective updates, shaped by modern optimizers, onto a target direction derived from a stable, in-distribution proxy. To ensure scalability, we employ Ghost technique with CountSketch for computational efficiency, and Boltzmann sampling for data diversity, incurring only 4.7\% additional compute overhead. OPUS achieves remarkable results across diverse corpora, quality tiers, optimizers, and model scales. In pre-training of GPT-2 Large/XL on FineWeb and FineWeb-Edu with 30B tokens, OPUS outperforms industrial-level baselines and even full 200B-token training. Moreover, when combined with industrial-level static filters, OPUS further improves pre-training efficiency, even with lower-quality data. Furthermore, in continued pre-training of Qwen3-8B-Base on SciencePedia, OPUS achieves superior performance using only 0.5B tokens compared to full training with 3B tokens, demonstrating significant data efficiency gains in specialized domains.

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

TL;DR

Abstract

Paper Structure (24 sections, 33 equations, 7 figures, 8 tables, 1 algorithm)

This paper contains 24 sections, 33 equations, 7 figures, 8 tables, 1 algorithm.

Introduction
Related Work
Background
LLM Pre-training
Data Selection in Pre-training
Modern Optimizers in Large-Scale Pre-training
Optimizer-induced Preconditioners
Stochastic gradient descent
Muon preconditioner
AdamW preconditioner
Methodology: OPUS
Optimizer-Induced Utility Objective
Scalable Utility Estimation
Boltzmann Sampling
Experiments
...and 9 more sections

Figures (7)

Figure 1: OPUS outperforms random selection by an average of 2.2% accuracy across 10 benchmarks and achieves 8$\times$ reduction in computation on GPT-XL using FineWeb dataset.
Figure 2: Comparison of different data selection methods.
Figure 3: Overview of OPUS pipeline.
Figure 4: Validation-loss curves on GPT-2 XL and GPT-2 Large pre-trained from scratch on FineWeb-Edu dataset. Left: Results on GPT-2 XL. OPUS compared with representative baselines trained on the high-quality pool, with Random 60B shown as a non compute-matched reference. Curves are shown up to 30B update tokens for compute-matched comparison. Right: Results on GPT2-Large.
Figure 5: Continued pre-training results on SciencePedia.
...and 2 more figures

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

TL;DR

Abstract

OPUS: Towards Efficient and Principled Data Selection in Large Language Model Pre-training in Every Iteration

Authors

TL;DR

Abstract

Table of Contents

Figures (7)