A Snapshot of Influence: A Local Data Attribution Framework for Online Reinforcement Learning
Yuzheng Hu, Fan Wu, Haotian Ye, David Forsyth, James Zou, Nan Jiang, Jiaqi W. Ma, Han Zhao
TL;DR
This work introduces a local data attribution framework for online reinforcement learning, focusing on PPO. By treating recent rollout-buffer records as attribution units and defining two target functions—agent actions and cumulative return—it measures each record's influence via gradient-based similarity, adapting TracIn to the online setting. The authors demonstrate the utility of the framework through diagnosis, behavioral analysis, and targeted interventions, and they develop Iterative Influence-Based Filtering (IIF) to filter harmful experiences, yielding substantial gains in sample efficiency, training speed, and final returns, including strong improvements in RLHF scenarios. While the approach shows promise across diverse RL tasks, the work discusses limitations such as optimizer choices and the need for broader applicability to other RL algorithms and counterfactual interpretations, outlining clear directions for future research.
Abstract
Online reinforcement learning (RL) excels in complex, safety-critical domains but suffers from sample inefficiency, training instability, and limited interpretability. Data attribution provides a principled way to trace model behavior back to training samples, yet existing methods assume fixed datasets, which is violated in online RL where each experience both updates the policy and shapes future data collection. In this paper, we initiate the study of data attribution for online RL, focusing on the widely used Proximal Policy Optimization (PPO) algorithm. We start by establishing a \emph{local} attribution framework, interpreting model checkpoints with respect to the records in the recent training buffer. We design two target functions, capturing agent action and cumulative return respectively, and measure each record's contribution through gradient similarity between its training loss and these targets. We demonstrate the power of this framework through three concrete applications: diagnosis of learning, temporal analysis of behavior formation, and targeted intervention during training. Leveraging this framework, we further propose an algorithm, iterative influence-based filtering (IIF), for online RL training that iteratively performs experience filtering to refine policy updates. Across standard RL benchmarks (classic control, navigation, locomotion) to RLHF for large language models, IIF reduces sample complexity, speeds up training, and achieves higher returns. Together, these results open a new direction for making online RL more interpretable, efficient, and effective.
