Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration
Ziqi Zhang, Xiao Xiong, Zifeng Zhuang, Jinxin Liu, Donglin Wang
TL;DR
This work tackles the distribution shift that harms offline-to-online reinforcement learning by introducing QCSE, a Q-conditioned state entropy intrinsic reward that promotes diverse state exploration conditioned on Q-values, thus implicitly achieving State Marginal Matching (SMM). The authors provide theoretical arguments showing QCSE preserves monotonic Soft-Q optimization and, under a finite action set, converges toward an optimal policy, while protecting transitions by conditioning on Q-values. Empirically, QCSE improves online fine-tuning performance for CQL and Cal-QL by approximately 8–13% across Gym-Mujoco and Antmaze tasks and generalizes to other model-free algorithms. The approach offers a plug-and-play reward augmentation with broad applicability and demonstrates robustness across hyperparameters and task domains, signaling a practical advance for data-efficient offline-to-online RL.
Abstract
Studying how to fine-tune offline reinforcement learning (RL) pre-trained policy is profoundly significant for enhancing the sample efficiency of RL algorithms. However, directly fine-tuning pre-trained policies often results in sub-optimal performance. This is primarily due to the distribution shift between offline pre-training and online fine-tuning stages. Specifically, the distribution shift limits the acquisition of effective online samples, ultimately impacting the online fine-tuning performance. In order to narrow down the distribution shift between offline and online stages, we proposed Q conditioned state entropy (QCSE) as intrinsic reward. Specifically, QCSE maximizes the state entropy of all samples individually, considering their respective Q values. This approach encourages exploration of low-frequency samples while penalizing high-frequency ones, and implicitly achieves State Marginal Matching (SMM), thereby ensuring optimal performance, solving the asymptotic sub-optimality of constraint-based approaches. Additionally, QCSE can seamlessly integrate into various RL algorithms, enhancing online fine-tuning performance. To validate our claim, we conduct extensive experiments, and observe significant improvements with QCSE (about 13% for CQL and 8% for Cal-QL). Furthermore, we extended experimental tests to other algorithms, affirming the generality of QCSE.
