Table of Contents
Fetching ...

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

Tal Fiskus, Uri Shaham

TL;DR

This work introduces SUFT, a causal upper-bound loss optimization for DRL that leverages the Neyman-Rubin potential outcomes framework to bound the on-policy (factual) loss using off-policy (counterfactual) loss plus an estimated treatment effect term. The key novelty is recycling past value-network outputs stored in the experience replay buffer to compute a SUFT OPE term, enabling data reuse with minimal overhead. The authors prove a bound of the form $\\epsilon_{F_{\\phi}} \\leq \\epsilon_{CF_{\\phi}} + \\psi_{\\phi} + \\delta$, and provide a universal implementation strategy to inject the SUFT term into DRL losses for DQN, PPO, and related agents. Empirically, SUFT yields substantial gains across Atari and MuJoCo benchmarks (up to 383% mean reward improvements) while reducing buffer sizes by up to 96%, demonstrating improved sample efficiency with negligible computational cost. The work suggests broad applicability to on- and off-policy DRL and points to robotics as a promising domain for future real-world impact.

Abstract

Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.

Turning Sand to Gold: Recycling Data to Bridge On-Policy and Off-Policy Learning via Causal Bound

TL;DR

This work introduces SUFT, a causal upper-bound loss optimization for DRL that leverages the Neyman-Rubin potential outcomes framework to bound the on-policy (factual) loss using off-policy (counterfactual) loss plus an estimated treatment effect term. The key novelty is recycling past value-network outputs stored in the experience replay buffer to compute a SUFT OPE term, enabling data reuse with minimal overhead. The authors prove a bound of the form , and provide a universal implementation strategy to inject the SUFT term into DRL losses for DQN, PPO, and related agents. Empirically, SUFT yields substantial gains across Atari and MuJoCo benchmarks (up to 383% mean reward improvements) while reducing buffer sizes by up to 96%, demonstrating improved sample efficiency with negligible computational cost. The work suggests broad applicability to on- and off-policy DRL and points to robotics as a promising domain for future real-world impact.

Abstract

Deep reinforcement learning (DRL) agents excel in solving complex decision-making tasks across various domains. However, they often require a substantial number of training steps and a vast experience replay buffer, leading to significant computational and resource demands. To address these challenges, we introduce a novel theoretical result that leverages the Neyman-Rubin potential outcomes framework into DRL. Unlike most methods that focus on bounding the counterfactual loss, we establish a causal bound on the factual loss, which is analogous to the on-policy loss in DRL. This bound is computed by storing past value network outputs in the experience replay buffer, effectively utilizing data that is usually discarded. Extensive experiments across the Atari 2600 and MuJoCo domains on various agents, such as DQN and SAC, achieve up to 383% higher reward ratio, outperforming the same agents without our proposed term, and reducing the experience replay buffer size by up to 96%, significantly improving sample efficiency at a negligible cost.

Paper Structure

This paper contains 48 sections, 5 theorems, 63 equations, 17 figures, 13 tables, 2 algorithms.

Key Result

Theorem 4.10

The expected factual outcome loss, $\epsilon_{F_{\phi}}$, is bounded by the expected counterfactual outcome loss, $\epsilon_{CF_{\phi}}$, the estimated treatment effect loss, $\psi_{\phi}$, and a constant term $\delta$, independent of $\phi$:

Figures (17)

  • Figure 1: Log-scaled reward improvements comparison between agents using the additional SUFT OPE term and the baseline agents without it across 57 Atari games. The results demonstrate the superior performance of our method across the majority of games. The red line indicates a 10% improvement, and the green line represents a 100% improvement. Left: Double DQN SUFT outperforms the baseline agent in 35 out of the 40 valid games; Right: PPO SUFT outperforms the baseline agent in 39 out of the 42 valid games.
  • Figure 2: Mean reward ratio comparison between agents using the additional SUFT OPE term and the baseline agents without it across 57 Atari games and five MuJoCo environments, highlighting the profound reward gains across diverse agents and domains.
  • Figure 3: Learning curves comparison between agents using the SUFT OPE term and the baseline agents without it across selected Atari games. The red line indicates human-level performance, showing that SUFT not only surpasses the baseline agent but can even exceed human rewards. Top: Double DQN; Bottom: PPO.
  • Figure 4: Illustration diagram that demonstrates the reduction from the causal inference framework (Left) to the DRL framework (Right).
  • Figure 5: Magnitude of the $\delta$ term, SUFT OPE term, and the standard DQN loss across the Atari 57 benchmark.
  • ...and 12 more figures

Theorems & Definitions (39)

  • Definition 4.2
  • Definition 4.3
  • Definition 4.4
  • Definition 4.5
  • Definition 4.6
  • Definition 4.7
  • Definition 4.8
  • Definition 4.9
  • Theorem 4.10
  • Proposition B.2
  • ...and 29 more