Table of Contents
Fetching ...

Reinforcement Pre-Training

Qingxiu Dong, Li Dong, Yao Tang, Tianzhu Ye, Yutao Sun, Zhifang Sui, Furu Wei

TL;DR

Reinforcement Pre-Training (RPT) reframes next-token prediction as a reasoning task trained with reinforcement learning on vast unannotated text using verifiable rewards derived from the ground-truth next token. It introduces a next-token reasoning objective that outputs a chain-of-thought before the final token and trains on-policy with a prefix-matching reward, enabling scalable RL pre-training. Empirical results show improved next-token prediction accuracy, favorable scaling with training compute, and stronger zero-shot performance on math and general knowledge benchmarks, while also providing a stronger foundation for subsequent RL fine-tuning. RPT offers a scalable, general-purpose RL pre-training paradigm that mitigates reward hacking and fosters deeper token-level reasoning, with potential to advance generalist large language models.

Abstract

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Reinforcement Pre-Training

TL;DR

Reinforcement Pre-Training (RPT) reframes next-token prediction as a reasoning task trained with reinforcement learning on vast unannotated text using verifiable rewards derived from the ground-truth next token. It introduces a next-token reasoning objective that outputs a chain-of-thought before the final token and trains on-policy with a prefix-matching reward, enabling scalable RL pre-training. Empirical results show improved next-token prediction accuracy, favorable scaling with training compute, and stronger zero-shot performance on math and general knowledge benchmarks, while also providing a stronger foundation for subsequent RL fine-tuning. RPT offers a scalable, general-purpose RL pre-training paradigm that mitigates reward hacking and fosters deeper token-level reasoning, with potential to advance generalist large language models.

Abstract

In this work, we introduce Reinforcement Pre-Training (RPT) as a new scaling paradigm for large language models and reinforcement learning (RL). Specifically, we reframe next-token prediction as a reasoning task trained using RL, where it receives verifiable rewards for correctly predicting the next token for a given context. RPT offers a scalable method to leverage vast amounts of text data for general-purpose RL, rather than relying on domain-specific annotated answers. By incentivizing the capability of next-token reasoning, RPT significantly improves the language modeling accuracy of predicting the next tokens. Moreover, RPT provides a strong pre-trained foundation for further reinforcement fine-tuning. The scaling curves show that increased training compute consistently improves the next-token prediction accuracy. The results position RPT as an effective and promising scaling paradigm to advance language model pre-training.

Paper Structure

This paper contains 27 sections, 5 equations, 7 figures, 11 tables.

Figures (7)

  • Figure 1: Reinforcement pre-training (RPT) reframes next-token prediction as a reasoning task, where the language model is incentivized via reinforcement learning (RL) to reason about and correctly predict the next token. The proposed approach allows RL to be scaled to the web-text corpus. The image of the cherry-on-top cake is taken from LeCun's slides lecun:cake.
  • Figure 1: Next-token prediction accuracy across three test splits of varying difficulty. RPT outperforms both the standard next-token prediction baselines and the reasoning-based prediction baseline.
  • Figure 2: Comparison of standard next‑token prediction and next‑token reasoning. Standard next‑token prediction estimates the next token in the pre-training corpus directly, while next‑token reasoning performs reasoning over multiple tokens before making the prediction.
  • Figure 3: An illustration of reinforcement pre-training. Given a context with a missing continuation, the LLM performs on-policy rollouts to generate $G$ different thinking trajectories. Each includes an intermediate reasoning step and a final prediction for the next token. A positive reward is assigned if the prediction matches the ground-truth token; otherwise, the reward is zero. This reward signal is used to update the LLM, encouraging trajectories that lead to accurate continuations.
  • Figure 4: Average next-token prediction accuracy across data of different difficulty levels. R1-Qwen-14B/32B denote R1-Distill-Qwen-14B/32B, respectively.
  • ...and 2 more figures