Table of Contents
Fetching ...

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

Zhepeng Cen, Haolin Chen, Shiyu Wang, Zuxin Liu, Zhiwei Liu, Ding Zhao, Silvio Savarese, Caiming Xiong, Huan Wang, Weiran Yao

TL;DR

This work tackles the data bottleneck in reinforcement learning for large language models by introducing Webscale-RL, a scalable pipeline that converts web-scale pretraining corpora into millions of verifiable QA pairs for RL. The resulting Webscale-RL dataset comprises 1.2 million QA pairs across 9+ domains, enabling RL at near-pretraining scales. Empirical results show that RL trained on Webscale-RL yields strong performance gains across diverse benchmarks and markedly improved data efficiency, achieving comparable results to continual pretraining with as little as 1/100 of the tokens in some cases. The approach provides a viable path to scaling RL alongside pretraining, unlocking more capable and efficient language models while highlighting areas for future refinement such as domain balance and reward-model efficiency.

Abstract

Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100$\times$ fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.

Webscale-RL: Automated Data Pipeline for Scaling RL Data to Pretraining Levels

TL;DR

This work tackles the data bottleneck in reinforcement learning for large language models by introducing Webscale-RL, a scalable pipeline that converts web-scale pretraining corpora into millions of verifiable QA pairs for RL. The resulting Webscale-RL dataset comprises 1.2 million QA pairs across 9+ domains, enabling RL at near-pretraining scales. Empirical results show that RL trained on Webscale-RL yields strong performance gains across diverse benchmarks and markedly improved data efficiency, achieving comparable results to continual pretraining with as little as 1/100 of the tokens in some cases. The approach provides a viable path to scaling RL alongside pretraining, unlocking more capable and efficient language models while highlighting areas for future refinement such as domain balance and reward-model efficiency.

Abstract

Large Language Models (LLMs) have achieved remarkable success through imitation learning on vast text corpora, but this paradigm creates a training-generation gap and limits robust reasoning. Reinforcement learning (RL) offers a more data-efficient solution capable of bridging this gap, yet its application has been constrained by a critical data bottleneck: existing RL datasets are orders of magnitude smaller and less diverse than web-scale pre-training corpora. To address this, we introduce the Webscale-RL pipeline, a scalable data engine that systematically converts large-scale pre-training documents into millions of diverse, verifiable question-answer pairs for RL. Using this pipeline, we construct the Webscale-RL dataset, containing 1.2 million examples across more than 9 domains. Our experiments show that the model trained on this dataset significantly outperforms continual pretraining and strong data refinement baselines across a suite of benchmarks. Notably, RL training with our dataset proves substantially more efficient, achieving the performance of continual pre-training with up to 100 fewer tokens. Our work presents a viable path toward scaling RL to pre-training levels, enabling more capable and efficient language models.

Paper Structure

This paper contains 24 sections, 2 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: The scaling on LLM RL is fundamentally bottlenecked by the scarcity of high-quality RL data. While pretraining leverages $>$1T diverse web tokens, RL datasets remain limited to $<$10B tokens with limited diversity. We propose Webscale-RL data pipeline to fundamentally improve the scalability of RL data: we convert the pretraining corpora to verifiable query and ground-truth answer pairs, scaling RL data to pretraining levels while preserving the diversity. The experiments show that RL with Webscale-RL data is significantly more effective and efficient than continual pretraining and data refinement baselines.
  • Figure 2: Overview of the Webscale-RL data pipeline that systematically converts large-scale pretraining data into RL data while preserving the scale and diversity of web data. The pipeline maintains a domain-specific demonstration library for few-shot examples for high quality generation and assigns multiple personas to each document to encourage reflecting different viewpoints. The generated QA pairs are verified for correctness and leakage prevention to ensure the reliability of the RL dataset.
  • Figure 3: Left: The domain distribution of Webscale-RL dataset. Right: The comparison on question embedding of Webscale-RL and Nemotron data. We randomly sample 5K questions from each dataset and visualize the embedding (by Qwen3-Embedding) reduced to 2D using UMAP.
  • Figure 4: Scaling comparison between Webscale-RL training and continual pretraining with the original pretraining corpora. We report the performances on MMLU-pro (left), Big-Bench (middle) and average on all benchmarks (right). The token number for RL training is calculated based on the original pretraining corpus used to generate the Webscale-RL dataset. The each data point in continual pretraining baselines are followed by a SFT training using the same 10K high-quality examples. The RL training on Webscale-RL consistently outperforms continual pretraining at different training scales and exhibits better scaling efficiency.