Reinforcement Pre-Training
Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei
TL;DR
Reinforcement Pre-Training (RPT) reframes next-token prediction as a reasoning task trained with reinforcement learning on vast unannotated text using verifiable rewards derived from the ground-truth next token. It introduces a next-token reasoning objective that outputs a chain-of-thought before the final token and trains on-policy with a prefix-matching reward, enabling scalable RL pre-training. Empirical results show improved next-token prediction accuracy, favorable scaling with training compute, and stronger zero-shot performance on math and general knowledge benchmarks, while also providing a stronger foundation for subsequent RL fine-tuning. RPT offers a scalable, general-purpose RL pre-training paradigm that mitigates reward hacking and fosters deeper token-level reasoning, with potential to advance generalist large language models.
Abstract
In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.
