On the Reuse Bias in Off-Policy Reinforcement Learning
Chengyang Ying, Zhongkai Hao, Xinning Zhou, Hang Su, Dong Yan, Jun Zhu
TL;DR
This work identifies Reuse Bias as a bias introduced when replay-buffer data is reused for both evaluating and optimizing the current policy in off-policy RL. It proves that IS-based evaluation can overestimate the true policy return and introduces a high-probability bound that depends on a KL-divergence term and a data-discrepancy term, linking bias control to algorithmic stability. To mitigate this bias, the authors propose Bias-Regularized Importance Sampling (BIRIS), which regularizes the policy optimization with a bias term based on the state-action likelihood ratio, and they show how to implement it with SAC and TD3. Empirically, BIRIS reduces Reuse Bias and improves sample efficiency on MiniGrid and MuJoCo tasks, offering a practical, plug-in enhancement that complements existing variance-reduction techniques and can be extended to actor-critic frameworks. Overall, the paper provides a formal bias analysis, a principled mitigation strategy, and strong empirical validation for more stable and efficient off-policy learning.
Abstract
Importance sampling (IS) is a popular technique in off-policy evaluation, which re-weights the return of trajectories in the replay buffer to boost sample efficiency. However, training with IS can be unstable and previous attempts to address this issue mainly focus on analyzing the variance of IS. In this paper, we reveal that the instability is also related to a new notion of Reuse Bias of IS -- the bias in off-policy evaluation caused by the reuse of the replay buffer for evaluation and optimization. We theoretically show that the off-policy evaluation and optimization of the current policy with the data from the replay buffer result in an overestimation of the objective, which may cause an erroneous gradient update and degenerate the performance. We further provide a high-probability upper bound of the Reuse Bias, and show that controlling one term of the upper bound can control the Reuse Bias by introducing the concept of stability for off-policy algorithms. Based on these analyses, we finally present a novel Bias-Regularized Importance Sampling (BIRIS) framework along with practical algorithms, which can alleviate the negative impact of the Reuse Bias. Experimental results show that our BIRIS-based methods can significantly improve the sample efficiency on a series of continuous control tasks in MuJoCo.
