A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning
Siyuan Guo, Yanchao Sun, Jifeng Hu, Sili Huang, Hechang Chen, Haiyin Piao, Lichao Sun, Yi Chang
TL;DR
The paper tackles the challenge of improving pretrained offline RL agents through online finetuning by addressing both constrained exploration and distribution shift. It introduces SUNG, a simple, unified framework that uses a VAE-based state-action visitation density to quantify uncertainty and guide both optimistic exploration and adaptive exploitation, integrated via an offline-to-online replay buffer. Empirically, SUNG improves online finetuning performance across multiple offline RL backbones (e.g., TD3+BC, CQL) on D4RL MuJoCo and AntMaze tasks and demonstrates robustness to hyperparameters and compatibility with other RL techniques. The work provides practical guidance for combining uncertainty estimation with offline-to-online learning, contributing a versatile approach to sample-efficient finetuning in offline-to-online RL.
Abstract
Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/SUNG.
