Table of Contents
Fetching ...

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye

TL;DR

This work identifies a fundamental suboptimality ceiling in BC-regularized offline RL when the dataset actions are not optimal. It formalizes convergence limitations and demonstrates the issue on a controlled bandit task, then introduces Proximal Action Replacement (PAR), a plug-and-play mechanism that progressively substitutes low-value dataset actions with high-value actions generated by a stable actor, governed by a critic-based reliability gate to preserve stability. PAR is compatible with multiple BC regularizations and yields consistent performance gains across diverse offline RL benchmarks, often approaching state-of-the-art while adding minimal computational overhead. Collectively, PAR offers a practical pathway to surpass imitation ceilings in offline RL by data-level augmentation with proximate, high-value exploration.

Abstract

Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.

Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning

TL;DR

This work identifies a fundamental suboptimality ceiling in BC-regularized offline RL when the dataset actions are not optimal. It formalizes convergence limitations and demonstrates the issue on a controlled bandit task, then introduces Proximal Action Replacement (PAR), a plug-and-play mechanism that progressively substitutes low-value dataset actions with high-value actions generated by a stable actor, governed by a critic-based reliability gate to preserve stability. PAR is compatible with multiple BC regularizations and yields consistent performance gains across diverse offline RL benchmarks, often approaching state-of-the-art while adding minimal computational overhead. Collectively, PAR offers a practical pathway to surpass imitation ceilings in offline RL by data-level augmentation with proximate, high-value exploration.

Abstract

Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
Paper Structure (17 sections, 4 theorems, 23 equations, 4 figures, 4 tables, 1 algorithm)

This paper contains 17 sections, 4 theorems, 23 equations, 4 figures, 4 tables, 1 algorithm.

Key Result

Theorem 3.1

(Sub-optimality of BC Regularization) Let $(\hat{\pi}, \hat{Q})$ be the computed optimal pair for the regularized objective $J_{\theta}(\pi_{\theta}, Q) = \mathbb{E}[\lambda Q(s, \pi_\theta) - \mathcal{L}_{\text{MSE}}(\pi_{\theta}, a_{\text{data}})]$ and the Bellman error. Let $(\pi^*, Q^*)$ denote Its proof is provided in Appendix proof_of_3_1.

Figures (4)

  • Figure 1: Effect comparison of PAR. Left: Normalized score comparison during the training process between TD3+BC and TD3+BC+PAR on Walker2d-Medium-Replay. Right: Average normalized score improvement of PAR across classic and advanced behavior cloning actor-critic methods in MuJoCo.
  • Figure 2: Offline RL experiments on a simple bandit task. The backbone algorithm is TD3+BC, where BC is set to three forms: MSE, KL, and MLE. The first row shows that behavior cloning loss prevents the learned policy from moving closer to the best policy. The second row shows that PAR can significantly alleviate this situation, making the learned policy closer to the optimal policy.
  • Figure 3: Illustration of the relationship between policy divergence and critic loss on Walker2d-Medium-Replay. The figure demonstrates that as the learned policy $\pi_{\theta}$ deviates from the behavior policy $\pi_{\beta}$, the critic's training loss increases, validating Theorem \ref{['theo:pao']}. This motivates the need for proximal constraints in PAR to ensure stable target approximation while allowing progressive policy improvement.
  • Figure 4: Hyperparameter ablation study on three representative Hopper tasks using TD3+BC as the basic model.

Theorems & Definitions (11)

  • Definition 2.1
  • Definition 2.2
  • Definition 2.3
  • Theorem 3.1
  • Theorem 3.2
  • proof
  • Theorem 2.1
  • proof
  • Theorem 2.2
  • proof
  • ...and 1 more