On the Reuse Bias in Off-Policy Reinforcement Learning

Chengyang Ying; Zhongkai Hao; Xinning Zhou; Hang Su; Dong Yan; Jun Zhu

On the Reuse Bias in Off-Policy Reinforcement Learning

Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Dong Yan, Jun Zhu

TL;DR

This work identifies Reuse Bias as a bias introduced when replay-buffer data is reused for both evaluating and optimizing the current policy in off-policy RL. It proves that IS-based evaluation can overestimate the true policy return and introduces a high-probability bound that depends on a KL-divergence term and a data-discrepancy term, linking bias control to algorithmic stability. To mitigate this bias, the authors propose Bias-Regularized Importance Sampling (BIRIS), which regularizes the policy optimization with a bias term based on the state-action likelihood ratio, and they show how to implement it with SAC and TD3. Empirically, BIRIS reduces Reuse Bias and improves sample efficiency on MiniGrid and MuJoCo tasks, offering a practical, plug-in enhancement that complements existing variance-reduction techniques and can be extended to actor-critic frameworks. Overall, the paper provides a formal bias analysis, a principled mitigation strategy, and strong empirical validation for more stable and efficient off-policy learning.

Abstract

Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we finally present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias. Experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.

On the Reuse Bias in Off-Policy Reinforcement Learning

TL;DR

Abstract

Paper Structure (32 sections, 8 theorems, 71 equations, 9 figures, 2 tables)

This paper contains 32 sections, 8 theorems, 71 equations, 9 figures, 2 tables.

Introduction
Related Work
Off-policy Evaluation.
Bias in Off-policy and Offline RL.
Preliminary
Reuse Bias
Reuse Bias
High Probability Bound for Reuse Error
Methodology
Theoretical Analysis on Controlling Reuse Bias
Bias-Regularized Importance Sampling Framework
Connection with Actor-Critic Methods
Experiments
Experiment Setup
Gridworld.
...and 17 more sections

Key Result

Theorem 1

Assume that $\mathcal{O}^*(\pi_0, \mathcal{B})$ is the optimal policy of $\mathcal{H}$ over the replay buffer $\mathcal{B}$, i.e., We can show that $\hat{J}_{\hat{\Pi}, \mathcal{B}}(\mathcal{O}^*(\pi_0, \mathcal{B}))$ is an overestimation of $J(\mathcal{O}^*(\pi_0, \mathcal{B}))$, i.e., $\epsilon_{\mathrm{RB}}(\mathcal{O}^*, \pi_0) = \mathbb{E}_{\mathcal{B}\sim\hat{\Pi}}\left[\epsilon_{\mathrm{RE

Figures (9)

Figure 1: (a) A high-level illustration of the Reuse Bias and our BIRIS. The Reuse Bias is caused by the fact that off-policy methods optimize and evaluate the policy with the same data in the replay buffer. (b) Experimental results of PG+IS, PG+WIS, PG+BIRIS+IS, and PG+BIRIS+WIS in MiniGrid 8$\times$8 with replay buffer size 30. In each subfigure, $J$ and $\hat{J}$ represent the expected return and the estimated return via the replay buffer of the target policy respectively. We repeat the experiment 50 times and plot the box diagram, where the orange dashed line represents the mean. This figure shows that Reuse Bias is severe to cause an erroneous policy evaluation and our BIRIS can significantly reduce the Reuse Bias. (More details are in Sec. \ref{['expe_reuse']} and Appendix \ref{['appendix-minigrid']})
Figure 2: Cumulative reward curves for SAC, SAC+PER, SAC+BIRIS, TD3, TD3+PER, and TD3+BIRIS. The x-axes indicate the number of steps interacting with the environment, and the y-axes indicate the performance of the agent, including average rewards with std.
Figure 3: Results of ablation study. The left one reports the influence of clipping. The right one reports the influence of the hyperparameter $\alpha$
Figure :
Figure :
...and 4 more figures

Theorems & Definitions (18)

Definition 1: Reuse Bias
Theorem 1: Overestimation for Off-Policy Evaluation, Proof in Appendix \ref{['reused_proof']}
Theorem 2: Overestimation for One-Step PG, Proof in Appendix \ref{['proof_one_pg']}
Theorem 3: Proof in Appendix \ref{['proof_optimal_theorem']}
Theorem 4: High-Probability Bound for Reuse Error, Proof in Appendix \ref{['proof_main_theorem']}
Definition 2: Stability for Off-Policy Algorithm
Theorem 5: Bound for the Reuse Error of Stable Algorithm, Proof in Appendix \ref{['proof-stability']}
Theorem 6: Details and Proof are in Appendix \ref{['proof_calculate_beta']}
proof
proof
...and 8 more

On the Reuse Bias in Off-Policy Reinforcement Learning

TL;DR

Abstract

On the Reuse Bias in Off-Policy Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (9)

Theorems & Definitions (18)