Table of Contents
Fetching ...

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

Shirong Liu, Chenjia Bai, Zixian Guo, Hao Zhang, Gaurav Sharma, Yang Liu

TL;DR

SelfBC introduces a dynamic constraint for offline RL by using an EMA-updated reference policy, allowing the learned policy to progressively diverge from the dataset while remaining close to a continually improving benchmark. By embedding SelfBC into TD3, the method TD3+SelfBC (and TD3+ESBC with ensembles) achieves non-conservative, stable policy improvement and state-of-the-art performance among policy-constrained offline RL methods, especially on non-expert MuJoCo datasets. The approach is grounded in a CPI-like theoretical analysis that links conservative reference updates to monotonic improvement, and it is validated through comprehensive experiments and ablations. This framework mitigates the conservatism problem, offering a practical and scalable path for offline RL in real-world settings where data quality varies and online interaction is costly or unsafe.

Abstract

Policy constraint methods in offline reinforcement learning employ additional regularization techniques to constrain the discrepancy between the learned policy and the offline dataset. However, these methods tend to result in overly conservative policies that resemble the behavior policy, thus limiting their performance. We investigate this limitation and attribute it to the static nature of traditional constraints. In this paper, we propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies. By integrating this self-constraint mechanism into off-policy methods, our method facilitates the learning of non-conservative policies while avoiding policy collapse in the offline setting. Theoretical results show that our approach results in a nearly monotonically improved reference policy. Extensive experiments on the D4RL MuJoCo domain demonstrate that our proposed method achieves state-of-the-art performance among the policy constraint methods.

SelfBC: Self Behavior Cloning for Offline Reinforcement Learning

TL;DR

SelfBC introduces a dynamic constraint for offline RL by using an EMA-updated reference policy, allowing the learned policy to progressively diverge from the dataset while remaining close to a continually improving benchmark. By embedding SelfBC into TD3, the method TD3+SelfBC (and TD3+ESBC with ensembles) achieves non-conservative, stable policy improvement and state-of-the-art performance among policy-constrained offline RL methods, especially on non-expert MuJoCo datasets. The approach is grounded in a CPI-like theoretical analysis that links conservative reference updates to monotonic improvement, and it is validated through comprehensive experiments and ablations. This framework mitigates the conservatism problem, offering a practical and scalable path for offline RL in real-world settings where data quality varies and online interaction is costly or unsafe.

Abstract

Policy constraint methods in offline reinforcement learning employ additional regularization techniques to constrain the discrepancy between the learned policy and the offline dataset. However, these methods tend to result in overly conservative policies that resemble the behavior policy, thus limiting their performance. We investigate this limitation and attribute it to the static nature of traditional constraints. In this paper, we propose a novel dynamic policy constraint that restricts the learned policy on the samples generated by the exponential moving average of previously learned policies. By integrating this self-constraint mechanism into off-policy methods, our method facilitates the learning of non-conservative policies while avoiding policy collapse in the offline setting. Theoretical results show that our approach results in a nearly monotonically improved reference policy. Extensive experiments on the D4RL MuJoCo domain demonstrate that our proposed method achieves state-of-the-art performance among the policy constraint methods.
Paper Structure (37 sections, 4 theorems, 58 equations, 9 figures, 5 tables, 4 algorithms)

This paper contains 37 sections, 4 theorems, 58 equations, 9 figures, 5 tables, 4 algorithms.

Key Result

Lemma 1

kakade2002cpizhuang2023behaviorppo The difference in expected discounted return between two arbitrary policies $\pi',\pi$ is where $\rho_\pi\left(s\right) = \sum_{t=0}^{\infty}\gamma^t P\left(s_t=s|\pi\right)$ represents the unnormalized discounted state visitation frequencies.

Figures (9)

  • Figure 1: For different values of $\beta$, TD3+BC was trained on each of the D4RL fu2020d4rl datasets and, for the final learned policy, values were recorded for the normalized score and dataset BC MSE, i.e., the average BC penalty in Eq. \ref{['eq:TD3+BC']}. For two datasets, we illustrate in (a) the dependence of the normalized score and dataset BC MSE on $\beta$, and in (b) we plot the normalized score against the dataset BC log MSE obtained for different $\beta$ in blue (for TD3+BC) and compare against the corresponding results obtained with our proposed TD3+SelfBC under an identical training setting, which are plotted in green. (The results are averaged across multiple seeds).
  • Figure 2: The effect of $scale_{\textnormal{ref}}$ on performance. We show the normalized scores and the dataset BC MSEs during the training process of TD3+SelfBC.
  • Figure 3: Performance difference of TD3+EBC and three variants of TD3+SelfBC compared with TD3+SelfBC.
  • Figure 4: Performance difference of TD3+SelfBC variants with alternative pretrain algorithms compared with the original TD3+SelfBC that uses TD3+EBC.
  • Figure 5: Results, on 12 D4RLfu2020d4rl MuJoCo datasets, for TD3+BC trained with different values of the parameter $\beta$ that controls the stringency of the policy constraint. See Fig. \ref{['fig:analys2']}(a) caption for additional details.
  • ...and 4 more figures

Theorems & Definitions (9)

  • proof
  • Definition 1
  • Lemma 1
  • Definition 2
  • Theorem 1
  • proof
  • Definition 3
  • Lemma 2
  • Lemma 3