Table of Contents
Fetching ...

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

Charlie Zhang, Graham Neubig, Xiang Yue

TL;DR

The paper presents a fully controlled framework to dissect how pre-training, mid-training, and RL-based post-training jointly shape reasoning capabilities in language models, using synthetic, parseable reasoning tasks with explicit traces. It demonstrates that true extrapolative gains from RL occur only when pre-training leaves headroom and RL data sit at the model’s edge of competence, while minimal pre-training exposure is needed for contextual transfer. Mid-training robustly enhances generalization under fixed compute by installing priors that RL can leverage, and process-level supervision reduces reward hacking, improving both accuracy and reasoning fidelity. Together, these findings offer concrete guidance on data curricula, reward design, and compute budgeting to improve reasoning in language models.

Abstract

Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

On the Interplay of Pre-Training, Mid-Training, and RL on Reasoning Language Models

TL;DR

The paper presents a fully controlled framework to dissect how pre-training, mid-training, and RL-based post-training jointly shape reasoning capabilities in language models, using synthetic, parseable reasoning tasks with explicit traces. It demonstrates that true extrapolative gains from RL occur only when pre-training leaves headroom and RL data sit at the model’s edge of competence, while minimal pre-training exposure is needed for contextual transfer. Mid-training robustly enhances generalization under fixed compute by installing priors that RL can leverage, and process-level supervision reduces reward hacking, improving both accuracy and reasoning fidelity. Together, these findings offer concrete guidance on data curricula, reward design, and compute budgeting to improve reasoning in language models.

Abstract

Recent reinforcement learning (RL) techniques have yielded impressive reasoning improvements in language models, yet it remains unclear whether post-training truly extends a model's reasoning ability beyond what it acquires during pre-training. A central challenge is the lack of control in modern training pipelines: large-scale pre-training corpora are opaque, mid-training is often underexamined, and RL objectives interact with unknown prior knowledge in complex ways. To resolve this ambiguity, we develop a fully controlled experimental framework that isolates the causal contributions of pre-training, mid-training, and RL-based post-training. Our approach employs synthetic reasoning tasks with explicit atomic operations, parseable step-by-step reasoning traces, and systematic manipulation of training distributions. We evaluate models along two axes: extrapolative generalization to more complex compositions and contextual generalization across surface contexts. Using this framework, we reconcile competing views on RL's effectiveness. We show that: 1) RL produces true capability gains (pass@128) only when pre-training leaves sufficient headroom and when RL data target the model's edge of competence, tasks at the boundary that are difficult but not yet out of reach. 2) Contextual generalization requires minimal yet sufficient pre-training exposure, after which RL can reliably transfer. 3) Mid-training significantly enhances performance under fixed compute compared with RL only, demonstrating its central but underexplored role in training pipelines. 4) Process-level rewards reduce reward hacking and improve reasoning fidelity. Together, these results clarify the interplay between pre-training, mid-training, and RL, offering a foundation for understanding and improving reasoning LM training strategies.

Paper Structure

This paper contains 37 sections, 33 equations, 18 figures, 3 tables.

Figures (18)

  • Figure 1: Interplay of pre-, mid-, and post-training in LM reasoning.Left: RL yields genuine extrapolative gains only when task difficulty slightly exceeds the pre-training range; gains vanish when tasks are already covered or too out-of-distribution (up to +42% pass@128 when well-calibrated). Mid: Contextual generalization requires minimal yet sufficient pre-training exposure to long-tail contexts. RL fails with near-zero exposure but generalizes robustly with sparse exposure ($\ge$1%), yielding up to +60% pass@128. Right: A mid-training stage bridging pre-training and RL substantially improves OOD reasoning under fixed compute, with mid-training + RL outperforming RL alone by +10.8% on OOD-hard tasks.
  • Figure 2: Overview of the data generation framework, task setup, and process-verified evaluation. The figure depicts the dependency graph $\mathcal{G}$ and contextual templates $\tau$, the task setup for extrapolative and contextual generalization, and the process-verified evaluation framework that checks for correctness of reasoning steps.
  • Figure 3: pass@k performance on three tasks: ID (op=2-10), OOD-edge (op=11-14), OOD-hard (op=(15-20)). RL is applied to four different data regimes (colors). RL on ID tasks never improves beyond the base model at pass@128. RL consistently improves pass@128 on harder tasks when applied beyond the base model's capacity.
  • Figure 4: pass@128 performance on context B after post-trained with a 50% context A + 50% context B mixture. Different lines represent levels of pre-training exposure to long-tailed context B atomic op=2 examples. RL incentivizes contextual generalization when the model has minimal exposure ($\geq$1%) to context B in pre-training.
  • Figure 5: Distribution of topological similarity between generated correct context B and gold context A graphs.
  • ...and 13 more figures