Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound
Tal Fiskus, Uri Shaham
TL;DR
This work introduces SUFT, a causal upper-bound loss optimization for DRL that leverages the Neyman-Rubin potential outcomes framework to bound the on-policy (factual) loss using off-policy (counterfactual) loss plus an estimated treatment effect term. The key novelty is recycling past value-network outputs stored in the experience replay buffer to compute a SUFT OPE term, enabling data reuse with minimal overhead. The authors prove a bound of the form $\\epsilon_{F_{\\phi}} \\leq \\epsilon_{CF_{\\phi}} + \\psi_{\\phi} + \\delta$, and provide a universal implementation strategy to inject the SUFT term into DRL losses for DQN, PPO, and related agents. Empirically, SUFT yields substantial gains across Atari and MuJoCo benchmarks (up to 383% mean reward improvements) while reducing buffer sizes by up to 96%, demonstrating improved sample efficiency with negligible computational cost. The work suggests broad applicability to on- and off-policy DRL and points to robotics as a promising domain for future real-world impact.
Abstract
Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.
