Table of Contents
Fetching ...

Unsupervised-to-Online Reinforcement Learning

Junsu Kim, Seohong Park, Sergey Levine

TL;DR

Unsupervised-to-online RL (U2O RL) is proposed, which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL and achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.

Abstract

Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.

Unsupervised-to-Online Reinforcement Learning

TL;DR

Unsupervised-to-online RL (U2O RL) is proposed, which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL and achieves strong performance that matches or even outperforms previous offline-to-online RL approaches.

Abstract

Offline-to-online reinforcement learning (RL), a framework that trains a policy with offline RL and then further fine-tunes it with online RL, has been considered a promising recipe for data-driven decision-making. While sensible, this framework has drawbacks: it requires domain-specific offline RL pre-training for each task, and is often brittle in practice. In this work, we propose unsupervised-to-online RL (U2O RL), which replaces domain-specific supervised offline RL with unsupervised offline RL, as a better alternative to offline-to-online RL. U2O RL not only enables reusing a single pre-trained model for multiple downstream tasks, but also learns better representations, which often result in even better performance and stability than supervised offline-to-online RL. To instantiate U2O RL in practice, we propose a general recipe for U2O RL to bridge task-agnostic unsupervised offline skill-based policy pre-training and supervised online fine-tuning. Throughout our experiments in nine state-based and pixel-based environments, we empirically demonstrate that U2O RL achieves strong performance that matches or even outperforms previous offline-to-online RL approaches, while being able to reuse a single pre-trained model for a number of different downstream tasks.
Paper Structure (27 sections, 5 equations, 15 figures, 4 tables, 1 algorithm)

This paper contains 27 sections, 5 equations, 15 figures, 4 tables, 1 algorithm.

Figures (15)

  • Figure 1: Illustration of U2O RL. In this work, we propose to replace supervised offline RL with unsupervised offline RL in the offline-to-online RL framework. We call this scheme unsupervised-to-online RL (U2O RL). U2O RL consists of three stages: (1) unsupervised offline RL pre-training, (2) bridging, and (3) online RL fine-tuning. In unsupervised offline RL pre-training, we train a multi-task skill policy $\pi_{\theta}(a \mid s, z)$ instead of a single-task policy $\pi_{\theta}(a \mid s)$. Then, we convert the multi-task policy into a task-specific policy in the bridging phase. Finally, we fine-tune the skill policy with online environment interactions.
  • Figure 2: Environments. We evaluate U2O RL on nine state-based or pixel-based environments.
  • Figure 3: Online fine-tuning plots of U2O RL and previous offline-to-online RL frameworks (8 seeds). Across the benchmarks, our U2O RL mostly shows consistently better performance than standard offline-to-online RL and off-policy online RL with offline data.
  • Figure 4: Learning curves during online RL fine-tuning (8 seeds). A single pre-trained model from U2O can be fine-tuned to solve multiple downstream tasks. Across the embodiments and tasks, our U2O RL matches or outperforms standard offline-to-online RL and off-policy online RL with offline data even though U2O RL uses a single task-agnostic pre-trained model.
  • Figure 5: Feature dot products during offline RL pre-training (lower is better, 8 seeds). The plots show that unsupervised offline pre-training effectively prevents feature collapse (co-adaptation), yielding better representations than supervised offline pre-training.
  • ...and 10 more figures