Table of Contents
Fetching ...

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Dong Shu, Denghui Zhang, Jessica Hullman

Abstract

Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.

Learning from the Right Rollouts: Data Attribution for PPO-based LLM Post-Training

Abstract

Traditional RL algorithms like Proximal Policy Optimization (PPO) typically train on the entire rollout buffer, operating under the assumption that all generated episodes provide a beneficial optimization signal. However, these episodes frequently contain noisy or unfaithful reasoning, which can degrade model performance and slow down training. In this paper, we propose \textbf{Influence-Guided PPO (I-PPO)}, a novel framework that integrates data attribution into the RL post-training loop. By calculating an influence score for each episode using a gradient-based approximation, I-PPO identifies and eliminates episodes that are anti-aligned with a validation gradient. Our experiments demonstrate that I-PPO consistently outperforms SFT and PPO baselines. We show that our filtering process acts as an intrinsic early stopping mechanism, accelerating training efficiency while effectively reducing unfaithful CoT reasoning.

Paper Structure

This paper contains 37 sections, 13 equations, 8 figures, 4 tables.

Figures (8)

  • Figure 1: Overview of I-PPO Framework. (a) Traditional PPO uses the raw rollout buffer for policy updates. (b) I-PPO refines the rollout by calculating influence scores for each generated episode. Negative episodes are removed.
  • Figure 2: Cost Analysis per Training Step. Comparisons are conducted using the Rho model on (a) GSM8K dataset and (b) MATH dataset. The total training duration is annotated next to the plot.
  • Figure 3: Ablation Study on the Rho-1B model. We compare the I-PPO framework with reweighting (hatched bars) against a variant without reweighting (solid bars) across five datasets.
  • Figure 4: Distribution of Unfaithful Reasoning Patterns. The chart illustrates total episode counts for Group C (correct final answers) and Group NC (incorrect final answers). The upper section displays episodes with positive influence scores, while the lower section shows episodes with negative influence scores.
  • Figure 5: Distribution of Influence Scores Across Different Training Stages for the Rho-1B model on the GSM8K Dataset. The histograms illustrate the progressive shift from predominantly positive, widely distributed scores in early training (steps 0-100) to a narrower distribution dominated by zero or negative scores as the model converges (steps 300-382).
  • ...and 3 more figures