Table of Contents
Fetching ...

Offline Meta-Reinforcement Learning with Online Self-Supervision

Vitchyr H. Pong, Ashvin Nair, Laura Smith, Catherine Huang, Sergey Levine

TL;DR

<3-5 sentence high-level summary>SMAC tackles offline meta-reinforcement learning by diagnosing a distribution shift in the adaptation context ${\bf z}$ that arises when offline data is used to train a meta-policy. It introduces a semi-supervised framework that first performs offline meta-training and then gathers unlabeled online data, labeling these new transitions with synthetic rewards via a learned reward decoder to bridge the shift. The method combines PEARL-style context encoding with AWAC-based offline updates and a self-supervised online phase, yielding substantial improvements across diverse multi-task robotics domains, often matching fully online meta-RL. This approach significantly lowers reward-labeling costs while preserving adaptation performance, enabling practical offline-to-online meta-learning in real-world tasks.

Abstract

Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.

Offline Meta-Reinforcement Learning with Online Self-Supervision

TL;DR

<3-5 sentence high-level summary>SMAC tackles offline meta-reinforcement learning by diagnosing a distribution shift in the adaptation context that arises when offline data is used to train a meta-policy. It introduces a semi-supervised framework that first performs offline meta-training and then gathers unlabeled online data, labeling these new transitions with synthetic rewards via a learned reward decoder to bridge the shift. The method combines PEARL-style context encoding with AWAC-based offline updates and a self-supervised online phase, yielding substantial improvements across diverse multi-task robotics domains, often matching fully online meta-RL. This approach significantly lowers reward-labeling costs while preserving adaptation performance, enabling practical offline-to-online meta-learning in real-world tasks.

Abstract

Meta-reinforcement learning (RL) methods can meta-train policies that adapt to new tasks with orders of magnitude less data than standard RL, but meta-training itself is costly and time-consuming. If we can meta-train on offline data, then we can reuse the same static dataset, labeled once with rewards for different tasks, to meta-train policies that adapt to a variety of new tasks at meta-test time. Although this capability would make meta-RL a practical tool for real-world use, offline meta-RL presents additional challenges beyond online meta-RL or standard offline RL settings. Meta-RL learns an exploration strategy that collects data for adapting, and also meta-trains a policy that quickly adapts to data from a new task. Since this policy was meta-trained on a fixed, offline dataset, it might behave unpredictably when adapting to data collected by the learned exploration strategy, which differs systematically from the offline data and thus induces distributional shift. We propose a hybrid offline meta-RL algorithm, which uses offline data with rewards to meta-train an adaptive policy, and then collects additional unsupervised online data, without any reward labels to bridge this distribution shift. By not requiring reward labels for online collection, this data can be much cheaper to collect. We compare our method to prior work on offline meta-RL on simulated robot locomotion and manipulation tasks and find that using additional unsupervised online data collection leads to a dramatic improvement in the adaptive capabilities of the meta-trained policies, matching the performance of fully online meta-RL on a range of challenging domains that require generalization to new tasks.

Paper Structure

This paper contains 41 sections, 8 equations, 10 figures, 2 tables, 1 algorithm.

Figures (10)

  • Figure 1: (left) In offline meta-RL, an agent uses offline data from multiple tasks $T_1, T_2, \dots$, each with reward labels that must only be provided once. (middle) In online meta-RL, new reward supervision must be provided with every environment interaction. (right) In semi-supervised meta-RL, an agent uses an offline dataset collected once to learn to generate its own reward labels for new, online interactions. Similar to offline meta-RL, reward labels must only be provided once for the offline training, and unlike online meta-RL, the additional environment interactions require neither external reward supervision nor additional task sampling.
  • Figure 2: Left: The distribution of the KL-divergence between the posterior $q_{\phi_e}({\mathbf z} \mid {\mathbf h})$ and the prior $p({\mathbf z})$ over the course of meta-training, when ${\mathbf h}$ is sampled from offline data (blue) or online data generated by the learned policy (orange). Adapting to online data results in posteriors that are substantially farther from the prior, suggesting a significant difference in distribution over ${\mathbf z}$. Right: The performance of the policy after adapting to data from the offline dataset (blue) or the learned policy (orange). Since the same policy is evaluated, the performance drop when conditioned on the online data is likely due to the change in ${\mathbf z}$-distribution.
  • Figure 3: (Left) In the offline phase, we sample a history ${\mathbf h}'$ to compute the posterior $q_{\phi_e}({\mathbf z} \mid {\mathbf h}')$. We then use a sample from this encoder and another history batch ${\mathbf h}$ to train the networks. In red, we then update the networks with ${\mathbf h}$ and the ${\mathbf z}$ sample. (Right) During the self-supervised phase, we explore by sampling ${\mathbf z} \sim p(z)$ and conditioning our policy on these observations. We label rewards using our learned reward decoder, and append the resulting data to the training data. The training procedure is equivalent to the offline phase, except that we do not train the reward decoder or encoder since no additional ground-truth rewards are observed.
  • Figure 4: We propose a new meta-learning evaluation domain based on the environment from khazatsky2021val, in which a simulated Sawyer gripper can perform various manipulation tasks such as pushing a button, opening drawers, and picking and placing objects. We show a subset of meta-training (blue) and meta-test (orange) tasks. Each task contains a unique object configuration, and we test the agent on held-out tasks.
  • Figure 5: We report the final return of meta-test adaptation on unseen test tasks versus the amount of self-supervised meta-training following offline meta-training. Our method SMAC, shown in red, consistently trains to a reasonable performance from offline meta-RL (shown at step 0) and then steadily improves with online self-supervised experience. The offline meta-RL methods, MACAW mitchell2021offline and BOReL are competitive with the offline performance of SMAC but have no mechanism to improve via self-supervision. We also compare to SMAC (SAC ablation) which uses SAC instead of AWAC as the underlying RL algorithm. This ablation struggles to train a value function offline, and so struggles to improve on more difficult tasks.
  • ...and 5 more figures