A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Siyuan Guo; Yanchao Sun; Jifeng Hu; Sili Huang; Hechang Chen; Haiyin Piao; Lichao Sun; Yi Chang

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Siyuan Guo, Yanchao Sun, Jifeng Hu, Sili Huang, Hechang Chen, Haiyin Piao, Lichao Sun, Yi Chang

TL;DR

The paper tackles the challenge of improving pretrained offline RL agents through online finetuning by addressing both constrained exploration and distribution shift. It introduces SUNG, a simple, unified framework that uses a VAE-based state-action visitation density to quantify uncertainty and guide both optimistic exploration and adaptive exploitation, integrated via an offline-to-online replay buffer. Empirically, SUNG improves online finetuning performance across multiple offline RL backbones (e.g., TD3+BC, CQL) on D4RL MuJoCo and AntMaze tasks and demonstrates robustness to hyperparameters and compatibility with other RL techniques. The work provides practical guidance for combining uncertainty estimation with offline-to-online learning, contributing a versatile approach to sample-efficient finetuning in offline-to-online RL.

Abstract

Offline reinforcement learning (RL) provides a promising solution to learning an agent fully relying on a data-driven paradigm. However, constrained by the limited quality of the offline dataset, its performance is often sub-optimal. Therefore, it is desired to further finetune the agent via extra online interactions before deployment. Unfortunately, offline-to-online RL can be challenging due to two main challenges: constrained exploratory behavior and state-action distribution shift. In view of this, we propose a Simple Unified uNcertainty-Guided (SUNG) framework, which naturally unifies the solution to both challenges with the tool of uncertainty. Specifically, SUNG quantifies uncertainty via a VAE-based state-action visitation density estimator. To facilitate efficient exploration, SUNG presents a practical optimistic exploration strategy to select informative actions with both high value and high uncertainty. Moreover, SUNG develops an adaptive exploitation method by applying conservative offline RL objectives to high-uncertainty samples and standard online RL objectives to low-uncertainty samples to smoothly bridge offline and online stages. SUNG achieves state-of-the-art online finetuning performance when combined with different offline RL methods, across various environments and datasets in D4RL benchmark. Codes are made publicly available in https://github.com/guosyjlu/SUNG.

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

TL;DR

Abstract

Paper Structure (44 sections, 13 equations, 16 figures, 7 tables, 1 algorithm)

This paper contains 44 sections, 13 equations, 16 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Offline RL
Offline-to-Online RL
Uncertainty for RL
Preliminaries
SUNG
Uncertainty Quantification with Density Estimation
Optimistic Exploration via Bi-level Action Selection
Adaptive Exploitation with OOD Sample Identification
Algorithm Implementation Details
Offline-to-Online Replay Buffer
SUNG Framework for Offline-to-Online RL
Experiments
Experimental Setup
...and 29 more sections

Figures (16)

Figure 1: Overview of SUNG. During online finetuning, we alternate between (a) optimistic exploration strategy to collect behavior data from environment and (b) adaptive exploitation method to improve the policy. We adopt VAE for state-action density estimation to quantify uncertainty.
Figure 2: D4RL environment. Above: MuJoCo tasks of halfcheetah, hopper and walker2d. Below: AntMaze tasks with u-shape, medium, large maze.
Figure 3: Final performance of different offline-to-online RL methods with 200K environment steps. We report the mean D4RL score across 5 random seeds. hc = halfcheetah, hop = hopper, w = walker2d, m = medium, mr = medium-replay.
Figure 4: Performance difference of an ablation study of SUNG combined with TD3+BC and CQL, compared with the full algorithm. We report the mean performance difference across 5 different random seeds. Opt. = Optimistic, Unc. = Uncertainty, Adp. = Adaptive. hc = halfcheetah, hop = hopper, w = walker2d, r = random, m = medium, mr = medium-replay.
Figure 5: Comparing performance on MuJoCo domains with different finalist action set size $k$ used in the optimistic exploration strategy.
...and 11 more figures

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

TL;DR

Abstract

A Simple Unified Uncertainty-Guided Framework for Offline-to-Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (16)