Table of Contents
Fetching ...

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

Shenao Zhang, Donghan Yu, Yihao Feng, Bowen Jin, Zhaoran Wang, John Peebles, Zirui Wang

TL;DR

This work analyzes how mid-training shapes post-training RL in large language models by identifying a compact, temporally extended action subspace that improves pruning efficiency and accelerates RL convergence. It introduces RA3, a scalable mid-training algorithm based on a temporal variational bound that learns temporally consistent latent action abstractions and bootstraps data for fine-tuning. Empirical results on code-generation benchmarks show RA3 enhancing both pre-RL performance and post-training RLVR, with faster convergence and higher asymptotic performance across multiple datasets and base models. The approach highlights the importance of action abstractions over primitive actions for scalable, effective RL-enabled LLMs in real-world tasks.

Abstract

Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Learning to Reason as Action Abstractions with Scalable Mid-Training RL

TL;DR

This work analyzes how mid-training shapes post-training RL in large language models by identifying a compact, temporally extended action subspace that improves pruning efficiency and accelerates RL convergence. It introduces RA3, a scalable mid-training algorithm based on a temporal variational bound that learns temporally consistent latent action abstractions and bootstraps data for fine-tuning. Empirical results on code-generation benchmarks show RA3 enhancing both pre-RL performance and post-training RLVR, with faster convergence and higher asymptotic performance across multiple datasets and base models. The approach highlights the importance of action abstractions over primitive actions for scalable, effective RL-enabled LLMs in real-world tasks.

Abstract

Large language models excel with reinforcement learning (RL), but fully unlocking this potential requires a mid-training stage. An effective mid-training phase should identify a compact set of useful actions and enable fast selection among them through online RL. We formalize this intuition by presenting the first theoretical result on how mid-training shapes post-training: it characterizes an action subspace that minimizes both the value approximation error from pruning and the RL error during subsequent planning. Our analysis reveals two key determinants of mid-training effectiveness: pruning efficiency, which shapes the prior of the initial RL policy, and its impact on RL convergence, which governs the extent to which that policy can be improved via online interactions. These results suggest that mid-training is most effective when the decision space is compact and the effective horizon is short, highlighting the importance of operating in the space of action abstractions rather than primitive actions. Building on these insights, we propose Reasoning as Action Abstractions (RA3), a scalable mid-training algorithm. Specifically, we derive a sequential variational lower bound and optimize it by iteratively discovering temporally-consistent latent structures via RL, followed by fine-tuning on the bootstrapped data. Experiments on code generation tasks demonstrate the effectiveness of our approach. Across multiple base models, RA3 improves the average performance on HumanEval and MBPP by 8 and 4 points over the base model and the next-token prediction baseline. Furthermore, RA3 achieves faster convergence and higher asymptotic performance in RLVR on HumanEval+, MBPP+, LiveCodeBench, and Codeforces.

Paper Structure

This paper contains 24 sections, 2 theorems, 20 equations, 6 figures, 1 table, 1 algorithm.

Key Result

Theorem 3.1

The next-token prediction objective in eq_ntp is lower bounded by where $p(z_t| s_t, z_{t-1})$ is the prior distribution of $z_t$.

Figures (6)

  • Figure 1: RL training reward curve.
  • Figure 2: Examples of the data from mid-training and after reasoning bootstrapping, where transferable skills, such as dummy head creation and BFS, are abstracted and incorporated into the data.
  • Figure 3: Data bootstrapped with reasoning learned in the E step reduces the CE loss during the M step fine-tuning.
  • Figure 4: Evaluation results during mid-training, with accuracies averaged across four benchmarks.
  • Figure 5: RLVR evaluation results (mean and standard error across independent runs) of different mid-training algorithms.
  • ...and 1 more figures

Theorems & Definitions (3)

  • Theorem 3.1: Temporal ELBO
  • Proposition 4.1
  • Definition A.1: Suboptimal Action Subset