SelfBC: Self Behavior Cloning for Offline Reinforcement Learning
Shirong Liu, Chenjia Bai, Zixian Guo, Hao Zhang, Gaurav Sharma, Yang Liu
TL;DR
SelfBC introduces a dynamic constraint for offline RL by using an EMA-updated reference policy, allowing the learned policy to progressively diverge from the dataset while remaining close to a continually improving benchmark. By embedding SelfBC into TD3, the method TD3+SelfBC (and TD3+ESBC with ensembles) achieves non-conservative, stable policy improvement and state-of-the-art performance among policy-constrained offline RL methods, especially on non-expert MuJoCo datasets. The approach is grounded in a CPI-like theoretical analysis that links conservative reference updates to monotonic improvement, and it is validated through comprehensive experiments and ablations. This framework mitigates the conservatism problem, offering a practical and scalable path for offline RL in real-world settings where data quality varies and online interaction is costly or unsafe.
Abstract
Policy constraint methods in offline reinforcement learning employ additional regularization techniques to constrain the discrepancy between the learned policy and the offline dataset. However, these methods tend to result in overly conservative policies that resemble the behavior policy, thus limiting their performance. We investigate this limitation and attribute it to the static nature of traditional constraints. In this paper, we propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies. By integrating this self-constraint mechanism into off-policy methods, our method facilitates the learning of non-conservative policies while avoiding policy collapse in the offline setting. Theoretical results show that our approach results in a nearly monotonically improved reference policy. Extensive experiments on the D4RL MuJoCo domain demonstrate that our proposed method achieves state-of-the-art performance among the policy constraint methods.
