METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park; Oleh Rybkin; Sergey Levine

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Seohong Park, Oleh Rybkin, Sergey Levine

TL;DR

It is demonstrated that METRA can discover a variety of useful behaviors even in complex, pixel- based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid environments.

Abstract

Unsupervised pre-training strategies have proven to be highly effective in natural language processing and computer vision. Likewise, unsupervised reinforcement learning (RL) holds the promise of discovering a variety of potentially useful behaviors that can accelerate the learning of a wide array of downstream tasks. Previous unsupervised RL approaches have mainly focused on pure exploration and mutual information skill learning. However, despite the previous attempts, making unsupervised RL truly scalable still remains a major open challenge: pure exploration approaches might struggle in complex environments with large state spaces, where covering every possible transition is infeasible, and mutual information skill learning approaches might completely fail to explore the environment due to the lack of incentives. To make unsupervised RL scalable to complex, high-dimensional environments, we propose a novel unsupervised RL objective, which we call Metric-Aware Abstraction (METRA). Our main idea is, instead of directly covering the entire state space, to only cover a compact latent space $Z$ that is metrically connected to the state space $S$ by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

TL;DR

Abstract

that is metrically connected to the state space

by temporal distances. By learning to move in every direction in the latent space, METRA obtains a tractable set of diverse behaviors that approximately cover the state space, being scalable to high-dimensional environments. Through our experiments in five locomotion and manipulation environments, we demonstrate that METRA can discover a variety of useful behaviors even in complex, pixel-based environments, being the first unsupervised RL method that discovers diverse locomotion behaviors in pixel-based Quadruped and Humanoid. Our code and videos are available at https://seohong.me/projects/metra/

Paper Structure (27 sections, 5 theorems, 22 equations, 14 figures, 2 tables, 1 algorithm)

This paper contains 27 sections, 5 theorems, 22 equations, 14 figures, 2 tables, 1 algorithm.

Introduction
Why Might Previous Unsupervised RL Methods Fail To Scale?
Preliminaries and Problem Setting
A Scalable Objective for Unsupervised RL
Tractable Optimization
Full Objective: Metric-Aware Abstraction (METRA)
Experiments
Experimental Setup
Qualitative Comparison
Quantitative Comparison
Conclusion
Extended Related Work
Theoretical results
Universality of Inner Product Decomposition
Lipschitz Constraint under the Temporal Distance Metric
...and 12 more sections

Key Result

Theorem 4.1

Under some simplifying assumptions, linear squared METRA is equivalent to PCA under the temporal distance metric.

Figures (14)

Figure 1: Illustration of METRA. Our main idea for scalable unsupervised RL is to cover only the most "important" low-dimensional subset of the state space, analogously to PCA. Specifically, METRA covers the most "temporally spread-out" (non-linear) manifold, which would lead to approximate coverage of the state space ${\mathcal{S}}$. In the example above, the two-dimensional ${\mathcal{Z}}$ space captures behaviors running in all directions, not necessarily covering every possible leg pose.
Figure 2: Sketch comparing different unsupervised RL objectives. Pure exploration approaches try to cover every possible state, which is infeasible in complex environments (e.g., such methods might be "stuck" at forever finding novel joint angle configurations of a robot, without fully exploring the environment; see \ref{['fig:qual']}). The mutual information $I(S; Z)$ has no underlying distance metrics, and thus does not prioritize coverage enough, only focusing on skills that are discriminable. In contrast, our proposed Wasserstein dependency measure $I_{\mathcal{W}}(S; Z)$ maximizes the distance metric $d$, which we choose to be the temporal distance, forcing the learned skills to span the "longest" subspaces of the state space, analogously to (temporal, non-linear) PCA.
Figure 3: Examples of behaviors learned by 11 unsupervised RL methods. For locomotion environments, we plot the $x$-$y$ (or $x$) trajectories sampled from learned policies. For Kitchen, we measure the coincidental success rates for six predefined tasks. Different colors represent different skills $z$. METRA is the only method that discovers diverse locomotion skills in pixel-based Quadruped and Humanoid. We refer to \ref{['fig:qual_all']} for the complete qualitative results ($8$ seeds) of METRA and https://seohong.me/projects/metra/ for videos.
Figure 4: Benchmark environments.
Figure 5: Quantitative comparison with unsupervised skill discovery methods ($\mathbf{8}$ seeds). We measure the state/task coverage of the policies learned by five skill discovery methods. METRA exhibits the best coverage across all environments, while previous methods completely fail to explore the state spaces of pixel-based locomotion environments. Notably, METRA is the only method that discovers locomotion skills in pixel-based Quadruped and Humanoid.
...and 9 more figures

Theorems & Definitions (10)

Theorem 4.1: Informal statement of \ref{['thm:metra_pca']}
Lemma B.1
proof
Theorem B.2: $\phi(x)^\top \psi(y)$ is a universal approximator of $f(x, y)$
proof
Theorem B.3
proof
Definition C.1: Temporally consistent embedding
Theorem C.2: Linear squared METRA is PCA in the temporal embedding space
proof

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

TL;DR

Abstract

METRA: Scalable Unsupervised RL with Metric-Aware Abstraction

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (14)

Theorems & Definitions (10)