Table of Contents
Fetching ...

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park, Oleh Rybkin, Sergey Levine

TL;DR

It is demonstrated that METRA can discover a variety of useful behaviors even in complex, pixel- based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid environments.

Abstract

Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

TL;DR

It is demonstrated that METRA can discover a variety of useful behaviors even in complex, pixel- based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid environments.

Abstract

Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space that is metrically connected to the state space by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/
Paper Structure (27 sections, 5 theorems, 22 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 5 theorems, 22 equations, 14 figures, 2 tables, 1 algorithm.

Key Result

Theorem 4.1

Under some simplifying assumptions, linear squared METRA is equivalent to PCA under the temporal distance metric.

Figures (14)

  • Figure 1: Illustration of METRA. Our main idea for scalable unsupervised RL is to cover only the most "important" low-dimensional subset of the state space, analogously to PCA. Specifically, METRA covers the most "temporally spread-out" (non-linear) manifold, which would lead to approximate coverage of the state space ${\mathcal{S}}$. In the example above, the two-dimensional ${\mathcal{Z}}$ space captures behaviors running in all directions, not necessarily covering every possible leg pose.
  • Figure 2: Sketch comparing different unsupervised RL objectives. Pure exploration approaches try to cover every possible state, which is infeasible in complex environments (e.g., such methods might be "stuck" at forever finding novel joint angle configurations of a robot, without fully exploring the environment; see \ref{['fig:qual']}). The mutual information $I(S; Z)$ has no underlying distance metrics, and thus does not prioritize coverage enough, only focusing on skills that are discriminable. In contrast, our proposed Wasserstein dependency measure $I_{\mathcal{W}}(S; Z)$ maximizes the distance metric $d$, which we choose to be the temporal distance, forcing the learned skills to span the "longest" subspaces of the state space, analogously to (temporal, non-linear) PCA.
  • Figure 3: Examples of behaviors learned by 11 unsupervised RL methods. For locomotion environments, we plot the $x$-$y$ (or $x$) trajectories sampled from learned policies. For Kitchen, we measure the coincidental success rates for six predefined tasks. Different colors represent different skills $z$. METRA is the only method that discovers diverse locomotion skills in pixel-based Quadruped and Humanoid. We refer to \ref{['fig:qual_all']} for the complete qualitative results ($8$ seeds) of METRA and https://seohong.me/projects/metra/ for videos.
  • Figure 4: Benchmark environments.
  • Figure 5: Quantitative comparison with unsupervised skill discovery methods ($\mathbf{8}$ seeds). We measure the state/task coverage of the policies learned by five skill discovery methods. METRA exhibits the best coverage across all environments, while previous methods completely fail to explore the state spaces of pixel-based locomotion environments. Notably, METRA is the only method that discovers locomotion skills in pixel-based Quadruped and Humanoid.
  • ...and 9 more figures

Theorems & Definitions (10)

  • Theorem 4.1: Informal statement of \ref{['thm:metra_pca']}
  • Lemma B.1
  • proof
  • Theorem B.2: $\phi(x)^\top \psi(y)$ is a universal approximator of $f(x, y)$
  • proof
  • Theorem B.3
  • proof
  • Definition C.1: Temporally consistent embedding
  • Theorem C.2: Linear squared METRA is PCA in the temporal embedding space
  • proof