Table of Contents
Fetching ...

Behavior Injection: Preparing Language Models for Reinforcement Learning

Zhepeng Cen, Yihang Yao, William Han, Zuxin Liu, Ding Zhao

TL;DR

This work investigates why RL finetuning of LLMs yields inconsistent gains and identifies rollout accuracy distribution and data co-influence as key drivers. It introduces BRIDGE, a data augmentation pipeline that injects exploration and exploitation behaviors into SFT data to precondition models for RL, operationalized via a DAG-based reasoning framework. Empirical results on iGSM and PromptBench show BRIDGE consistently enhances RL gains across multiple base models, outperforming baselines and ablations. The findings offer a practical, data-centric strategy to improve RL efficiency and performance in reasoning tasks, with broader implications for data curation and behavioral augmentation in language models.

Abstract

Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.

Behavior Injection: Preparing Language Models for Reinforcement Learning

TL;DR

This work investigates why RL finetuning of LLMs yields inconsistent gains and identifies rollout accuracy distribution and data co-influence as key drivers. It introduces BRIDGE, a data augmentation pipeline that injects exploration and exploitation behaviors into SFT data to precondition models for RL, operationalized via a DAG-based reasoning framework. Empirical results on iGSM and PromptBench show BRIDGE consistently enhances RL gains across multiple base models, outperforming baselines and ablations. The findings offer a practical, data-centric strategy to improve RL efficiency and performance in reasoning tasks, with broader implications for data curation and behavioral augmentation in language models.

Abstract

Reinforcement learning (RL) has emerged as a powerful post-training technique to incentivize the reasoning ability of large language models (LLMs). However, LLMs can respond very inconsistently to RL finetuning: some show substantial performance gains, while others plateau or even degrade. To understand this divergence, we analyze the per-step influence of the RL objective and identify two key conditions for effective post-training: (1) RL-informative rollout accuracy, and (2) strong data co-influence, which quantifies how much the training data affects performance on other samples. Guided by these insights, we propose behavior injection, a task-agnostic data augmentation scheme applied prior to RL. Behavior injection enriches the supervised finetuning (SFT) data by seeding exploratory and exploitative behaviors, effectively making the model more RL-ready. We evaluate our method across two reasoning benchmarks with multiple base models. The results demonstrate that our theoretically motivated augmentation can significantly increase the performance gain from RL over the pre-RL model.

Paper Structure

This paper contains 25 sections, 1 theorem, 13 equations, 11 figures, 4 tables, 1 algorithm.

Key Result

Proposition 3.1

Suppose there are $n$ correct outputs in the sampled group with size $N$, denote the correct and incorrect outputs as $\{{\mathbf{o}}_{i+}\}_{i=1}^n$ and $\{{\mathbf{o}}_{j-}\}_{j=1}^{N-n}$ respectively. When RL training on ${\mathbf{q}}$ is strictly on policy, with a sufficiently small $\beta$, the where $\alpha=n/N$ indicates the accuracy rate for the rollout samples, ${\mathcal{K}}_\theta(({\ma

Figures (11)

  • Figure 1: Overview of the BRIDGE pipeline: We augment the SFT data by introducing exploration and exploitation behaviors to prepare LLMs ready for RL finetuning.
  • Figure 2: DAG representation of behaviors.
  • Figure 3: Left: Training curve, Middle: SFT model rollout accuracy distribution, Right:. Per-step influence visualization. The top and bottom plots correspond to the results of Qwen2.5-1B and Llama3.2-1B respectively. We group the approximated per-step influence for samples with different accuracy in right plots, where the influences of samples with all correct or all wrong answers are $0$.
  • Figure 4: The ablation of models with different behaviors in iGSM task. We present the validation accuracy curves as the RL finetuning performances, where the validation set consists of 500 queries with $21\sim 25$ operations. We also compare the average per-step influence of the SFT models with different behaviors.
  • Figure 5: The ablations on behavior injection probability.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Proposition 3.1
  • proof
  • Remark A.1: Discussions on the assumptions
  • Remark A.2: Relation of data co-influence to neural tangent kernel (NTK)