Unsupervised-to-Online Reinforcement Learning

Junsu Kim; Seohong Park; Sergey Levine

Unsupervised-to-Online Reinforcement Learning

Junsu Kim, Seohong Park, Sergey Levine

TL;DR

Unsupervised-to-online RL (U2O RL) is proposed, which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL and achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.

Abstract

Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.

Unsupervised-to-Online Reinforcement Learning

TL;DR

Abstract

Paper Structure (27 sections, 5 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 15 figures, 4 tables, 1 algorithm.

Introduction
Related work
Preliminaries
Unsupervised-to-online RL (U2O RL)
Unsupervised offline policy pre-training
Bridging offline unsupervised RL and online supervised RL
Online fine-tuning
Why is U2O RL potentially better than offline-to-online RL?
Experiments
Experimental setup
Is U2O RL better than previous offline-to-online RL frameworks?
How does U2O RL compare to previous specialized offline-to-online RL techniques?
Can a single pre-trained model from U2O be fine-tuned to solve multiple tasks?
Why does U2O RL often outperform supervised offline-to-online RL?
Is fine-tuning better than other alternative strategies (e.g., hierarchical RL)?
...and 12 more sections

Figures (15)

Figure 1: Illustration of U2O RL. In this work, we propose to replace supervised offline RL with unsupervised offline RL in the offline-to-online RL framework. We call this scheme unsupervised-to-online RL (U2O RL). U2O RL consists of three stages: (1) unsupervised offline RL pre-training, (2) bridging, and (3) online RL fine-tuning. In unsupervised offline RL pre-training, we train a multi-task skill policy $\pi_{\theta}(a \mid s, z)$ instead of a single-task policy $\pi_{\theta}(a \mid s)$. Then, we convert the multi-task policy into a task-specific policy in the bridging phase. Finally, we fine-tune the skill policy with online environment interactions.
Figure 2: Environments. We evaluate U2O RL on nine state-based or pixel-based environments.
Figure 3: Online fine-tuning plots of U2O RL and previous offline-to-online RL frameworks (8 seeds). Across the benchmarks, our U2O RL mostly shows consistently better performance than standard offline-to-online RL and off-policy online RL with offline data.
Figure 4: Learning curves during online RL fine-tuning (8 seeds). A single pre-trained model from U2O can be fine-tuned to solve multiple downstream tasks. Across the embodiments and tasks, our U2O RL matches or outperforms standard offline-to-online RL and off-policy online RL with offline data even though U2O RL uses a single task-agnostic pre-trained model.
Figure 5: Feature dot products during offline RL pre-training (lower is better, 8 seeds). The plots show that unsupervised offline pre-training effectively prevents feature collapse (co-adaptation), yielding better representations than supervised offline pre-training.
...and 10 more figures

Unsupervised-to-Online Reinforcement Learning

TL;DR

Abstract

Unsupervised-to-Online Reinforcement Learning

Authors

TL;DR

Abstract

Table of Contents

Figures (15)