Proximal Action Replacement for Behavior Cloning Actor-Critic in Offline Reinforcement Learning
Jinzong Dong, Wei Huang, Jianshu Zhang, Zhuo Chen, Xinzhe Yuan, Qinying Gu, Zhaohui Jiang, Nanyang Ye
TL;DR
This work identifies a fundamental suboptimality ceiling in BC-regularized offline RL when the dataset actions are not optimal. It formalizes convergence limitations and demonstrates the issue on a controlled bandit task, then introduces Proximal Action Replacement (PAR), a plug-and-play mechanism that progressively substitutes low-value dataset actions with high-value actions generated by a stable actor, governed by a critic-based reliability gate to preserve stability. PAR is compatible with multiple BC regularizations and yields consistent performance gains across diverse offline RL benchmarks, often approaching state-of-the-art while adding minimal computational overhead. Collectively, PAR offers a practical pathway to surpass imitation ceilings in offline RL by data-level augmentation with proximate, high-value exploration.
Abstract
Offline reinforcement learning (RL) optimizes policies from a previously collected static dataset and is an important branch of RL. A popular and promising approach is to regularize actor-critic methods with behavior cloning (BC), which yields realistic policies and mitigates bias from out-of-distribution actions, but can impose an often-overlooked performance ceiling: when dataset actions are suboptimal, indiscriminate imitation structurally prevents the actor from fully exploiting high-value regions suggested by the critic, especially in later training when imitation is already dominant. We formally analyzed this limitation by investigating convergence properties of BC-regularized actor-critic optimization and verified it on a controlled continuous bandit task. To break this ceiling, we propose proximal action replacement (PAR), a plug-and-play training sample replacer that progressively replaces low-value actions with high-value actions generated by a stable actor, broadening the action exploration space while reducing the impact of low-value data. PAR is compatible with multiple BC regularization paradigms. Extensive experiments across offline RL benchmarks show that PAR consistently improves performance and approaches state-of-the-art when combined with the basic TD3+BC.
