Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Ziqi Zhang; Xiao Xiong; Zifeng Zhuang; Jinxin Liu; Donglin Wang

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Ziqi Zhang, Xiao Xiong, Zifeng Zhuang, Jinxin Liu, Donglin Wang

TL;DR

This work tackles the distribution shift that harms offline-to-online reinforcement learning by introducing QCSE, a Q-conditioned state entropy intrinsic reward that promotes diverse state exploration conditioned on Q-values, thus implicitly achieving State Marginal Matching (SMM). The authors provide theoretical arguments showing QCSE preserves monotonic Soft-Q optimization and, under a finite action set, converges toward an optimal policy, while protecting transitions by conditioning on Q-values. Empirically, QCSE improves online fine-tuning performance for CQL and Cal-QL by approximately 8–13% across Gym-Mujoco and Antmaze tasks and generalizes to other model-free algorithms. The approach offers a plug-and-play reward augmentation with broad applicability and demonstrates robustness across hyperparameters and task domains, signaling a practical advance for data-efficient offline-to-online RL.

Abstract

Studying how to fine-tune offline reinforcement learning (RL) pre-trained policy is profoundly significant for enhancing the sample efficiency of RL algorithms. However, directly fine-tuning pre-trained policies often results in sub-optimal performance. This is primarily due to the distribution shift between offline pre-training and online fine-tuning stages. Specifically, the distribution shift limits the acquisition of effective online samples, ultimately impacting the online fine-tuning performance. In order to narrow down the distribution shift between offline and online stages, we proposed Q conditioned state entropy (QCSE) as intrinsic reward. Specifically, QCSE maximizes the state entropy of all samples individually, considering their respective Q values. This approach encourages exploration of low-frequency samples while penalizing high-frequency ones, and implicitly achieves State Marginal Matching (SMM), thereby ensuring optimal performance, solving the asymptotic sub-optimality of constraint-based approaches. Additionally, QCSE can seamlessly integrate into various RL algorithms, enhancing online fine-tuning performance. To validate our claim, we conduct extensive experiments, and observe significant improvements with QCSE (about 13% for CQL and 8% for Cal-QL). Furthermore, we extended experimental tests to other algorithms, affirming the generality of QCSE.

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

TL;DR

Abstract

Paper Structure (54 sections, 5 theorems, 19 equations, 11 figures, 9 tables, 1 algorithm)

This paper contains 54 sections, 5 theorems, 19 equations, 11 figures, 9 tables, 1 algorithm.

Introduction
Related Work
Offline RL.
Offline-to-Online RL.
Online Exploration.
Preliminary
Reinforcement Learning (RL).
Model-free Offline RL.
Drawbacks of previous offline-to-online algorithms.
Q conditioned state entropy maximization (QCSE)
State entropy maximization implicitly realize SMM during the online fine-tuning stage.
Implementation of QCSE.
Advantages of QCSE
Monotonic of QCSE.
Q condition protects transitions from being disrupted by entropy maximization.
...and 39 more sections

Key Result

Theorem 4

Repetitive using lemmalemma1 and lemmalemma2 to any $\pi \in \Pi$ leads to convergence towards a policy $\pi^*$. And it can be proved that $Q^{\pi^*}\left(\mathbf{s}_t, \mathbf{a}_t\right) \geq Q^\pi\left(\mathbf{s}_t, \mathbf{a}_t\right)$ for all policies $\pi \in \Pi$ and all state-action pairs $\

Figures (11)

Figure 1: Demonstration of QCSE.
Figure 2: Q condition vs. V condition. In this experiment, we selected AWAC as the base algorithm and compared using V network and Q network to calculate the intrinsic reward's condition. The experimental results indicate that using the Q-network to compute the condition leads to overall better performance for AWAC. nair2021awac points out that AWAC demonstrates poor online fine-tuning performance.
Figure 3: Online fine-tuning curve on selected tasks. We tested QCSE by comparing Cal-QL-QCSE, CQL-QCSE to Cal-QL, CQL on selected tasks in the Gym-Mujoco and Antmaze domains, and then reported the average return curves of multi-time evaluation. As shown in this Figure, QCSE can improve Cal-QL and CQL's offline fine-tuning sample efficiency and achieves better performance than baseline (CQL and Cal-QL $\textit{without}$ QCSE) $\textit{over all selected tasks}$.
Figure 4: Performance of $\textbf{Alg}$-QCSE. We test QCSE with AWAC, TD3+BC, and IQL on selected Gym-Mujoco tasks, QCSE can obviously improve the performance of these algorithms on selected Gym-Mujoco tasks, showing QCSE's versatility.
Figure 5: Performance comparison for variety of exploration Methods. (a) Online fine-tuning performance difference between SAC and SAC-QCSE. (b) Online fine-tuning performance difference between various exploration methods with IQL and AWAC.
...and 6 more figures

Theorems & Definitions (8)

Definition 1: Marginal State distribution
Definition 2: State Marginal Matching
Definition 3: Critic Conditioned State Entropy
Theorem 4: Converged QCSE Soft Policy is Optimal
Lemma 5: Soft Policy Evaluation with QCSE.
Lemma 6: Soft Policy Improvement with QCSE
Theorem 7: Converged QCSE Soft Policy is Optimal
Theorem 8: Conservative Soft Q values with QCSE

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

TL;DR

Abstract

Improving Offline-to-Online Reinforcement Learning with Q Conditioned State Entropy Exploration

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (11)

Theorems & Definitions (8)