Table of Contents
Fetching ...

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao

TL;DR

This work introduces a local data attribution framework for online reinforcement learning, focusing on PPO. By treating recent rollout-buffer records as attribution units and defining two target functions—agent actions and cumulative return—it measures each record's influence via gradient-based similarity, adapting TracIn to the online setting. The authors demonstrate the utility of the framework through diagnosis, behavioral analysis, and targeted interventions, and they develop Iterative Influence-Based Filtering (IIF) to filter harmful experiences, yielding substantial gains in sample efficiency, training speed, and final returns, including strong improvements in RLHF scenarios. While the approach shows promise across diverse RL tasks, the work discusses limitations such as optimizer choices and the need for broader applicability to other RL algorithms and counterfactual interpretations, outlining clear directions for future research.

Abstract

Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.

A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning

TL;DR

This work introduces a local data attribution framework for online reinforcement learning, focusing on PPO. By treating recent rollout-buffer records as attribution units and defining two target functions—agent actions and cumulative return—it measures each record's influence via gradient-based similarity, adapting TracIn to the online setting. The authors demonstrate the utility of the framework through diagnosis, behavioral analysis, and targeted interventions, and they develop Iterative Influence-Based Filtering (IIF) to filter harmful experiences, yielding substantial gains in sample efficiency, training speed, and final returns, including strong improvements in RLHF scenarios. While the approach shows promise across diverse RL tasks, the work discusses limitations such as optimizer choices and the need for broader applicability to other RL algorithms and counterfactual interpretations, outlining clear directions for future research.

Abstract

Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.

Paper Structure

This paper contains 49 sections, 11 equations, 16 figures, 6 tables, 1 algorithm.

Figures (16)

  • Figure 1: An Illustration of the alternating learning cycle in online RL (\ref{['subsec:orl']}) and our local data attribution framework (\ref{['subsec:framework']}). Online RL operates in alternating rounds of data collection and policy updates; our local data attribution framework quantifies how individual records from a single round influence different aspects of policy update in that round.
  • Figure 2: Twofold data influence: driving policy updates, shaping future data collection.
  • Figure 3: (a-b) Examples of bottom records. (a) Bottom 100 records in FrozenLake at $k=5$, aggregated over $(s,a)$ for demonstration: arrow indicates action, green/red for positive/negative $\hat{A}$. (b) Selected records among bottom 20 in MiniGrid at $k=5$: $\blacktriangledown$--agent, $\blacksquare$--goal, gray area--the limited egocentric observation, yellow arrows--agent action in $\{\text{turn left}, \text{turn right}, \text{forward}\}$; all records shown are of positive $\hat{A}$. (c-d) These records are harmful due to their inaccurate advantage estimates. We sort records by decreasing influence (top on the left). (c) $y$ axis is $|\bar{A}-\hat{A}|$; points with same/opposite signs for $\hat{A}$ and $\bar{A}$ colored green/red; top/bottom 20% region shaded green/red, and the intermediate in gray. (d) The product $\bar{A}\cdot\hat{A}$ versus record rank, showing a strong negative correlation.
  • Figure 4: Phase change of top records in Highway, with the target behavior taking the action "slower" when tailing the front vehicle. In the inner plot, the black curve depicts $\pi(a|s)$; the red curve shows the measured roughness of the graph. : ego vehicle; : other vehicle. Three phases: : simple action-advantage associations; : semantic clustering (tailing states); : no clear patterns.
  • Figure 5: Boxplots of $\Delta$ return for single round interventions in two environments; red dashed line for zero $\Delta$. We intervene for each round independently. The $\Delta$ return is computed as the difference between the test return of the model trained on the filtered dataset and the original dataset. Results are shown for $3$ random seeds. Additional results can be found in \ref{['app:single-intervention']}.
  • ...and 11 more figures

Theorems & Definitions (2)

  • Remark 1: Use cases of the two target functions
  • Remark 2: Extension to other online RL algorithms