Table of Contents
Fetching ...

Latent-Predictive Empowerment: Measuring Empowerment without a Simulator

Andrew Levy, Alessandro Allievi, George Konidaris

TL;DR

LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states and that only requires a simpler latent-predictive model rather than a full simulator of the environment.

Abstract

Empowerment has the potential to help agents learn large skillsets, but is not yet a scalable solution for training general-purpose agents. Recent empowerment methods learn diverse skillsets by maximizing the mutual information between skills and states; however, these approaches require a model of the transition dynamics, which can be challenging to learn in realistic settings with high-dimensional and stochastic observations. We present Latent-Predictive Empowerment (LPE), an algorithm that can compute empowerment in a more practical manner. LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states and that only requires a simpler latent-predictive model rather than a full simulator of the environment. We show empirically in a variety of settings--including ones with high-dimensional observations and highly stochastic transition dynamics--that our empowerment objective (i) learns similar-sized skillsets as the leading empowerment algorithm that assumes access to a model of the transition dynamics and (ii) outperforms other model-based approaches to empowerment.

Latent-Predictive Empowerment: Measuring Empowerment without a Simulator

TL;DR

LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states and that only requires a simpler latent-predictive model rather than a full simulator of the environment.

Abstract

Empowerment has the potential to help agents learn large skillsets, but is not yet a scalable solution for training general-purpose agents. Recent empowerment methods learn diverse skillsets by maximizing the mutual information between skills and states; however, these approaches require a model of the transition dynamics, which can be challenging to learn in realistic settings with high-dimensional and stochastic observations. We present Latent-Predictive Empowerment (LPE), an algorithm that can compute empowerment in a more practical manner. LPE learns large skillsets by maximizing an objective that is a principled replacement for the mutual information between skills and states and that only requires a simpler latent-predictive model rather than a full simulator of the environment. We show empirically in a variety of settings--including ones with high-dimensional observations and highly stochastic transition dynamics--that our empowerment objective (i) learns similar-sized skillsets as the leading empowerment algorithm that assumes access to a model of the transition dynamics and (ii) outperforms other model-based approaches to empowerment.

Paper Structure

This paper contains 29 sections, 15 equations, 18 figures, 1 table, 1 algorithm.

Figures (18)

  • Figure 1: (Left) Illustration of the latent-predictive model and state encoding distributions for both diverse and redundant skillsets. The different colored circles represent different tuples of skills (shown in $\mathcal{Z}$ box), open loop action sequences (shown in $\mathcal{A}$ box), skill-terminating states (shown in $\mathcal{S}_n$ box), and skill-terminating latent representations (shown in $\mathcal{Z}_n$ box) generated by a skillset. For a diverse skillset in which different skills target different states, the latent-predictive model (teal arrows), which maps actions to latent states, can output unique latent states that match the output of the state encoding distribution (purple arrows), which maps skill-terminating states to latent vectors. This produces a high overall diversity score because the mutual information between skills and latent states, $I(Z;Z_n)$, is high because different skills target different latent states, and the KL divergence between the latent-predictive model and state encoding distribution is low. On the other hand, for redundant skillsets in which different skills target the same states, the latent-predictive model may map different actions to the same latent vector yielding a low overall diversity score because $I(Z;Z_n)$ is low. (Right) Comparison of the data required to optimize (i) $I(Z;S_n)$, the mutual information between skills and states, and (ii) our objective. For each candidate skillset $\pi_i$ (left column), $I(Z;S_n)$ may require $T$ tuples of (skill $z$, skill-ending state $s_n$), which in practice requires access to a simulator of the environment. On the other hand, most of the required data for our objective consists of the (skill $z$, action sequence $a$, latent representation $z_n$) tuples needed to estimate $I(Z;Z_n)$ for all candidate skillsets, which only requires learning a latent-predictive model.
  • Figure 2: Sample skill sequences in the pick-and-place versions of the Stochastic Four Rooms and RGB QR Code domains. In top row, the blue circle agent executes a skill to move away from red triangle object. In bottom row, the black square agent carries the yellow object to bottom of room.
  • Figure 3: Illustration of the uniform distribution over skills $\phi$ used by Skillset Empowerment and our approach. The uniform distribution takes the shape of a $d$-dimensional cube centered at the origin with side length $\phi$. For instance, if the dimensionality of the skill space is 2 (i.e., $d=2$) as in the figure, skills $z \sim \phi(z|s_0)$ are uniformly sampled from a square centered at the origin with side length $\phi$.
  • Figure 4: Illustration of how the parameter-specific critics, $Q_{\omega_i}$ for $i=0.\dots,|\pi|-1$, attach to the actor $f_{\lambda}$ in order to determine the gradients of the actor. For each parameter $i$ in $\pi$, a critic $Q_{\omega_i}$ approximates how the diversity of the skill-conditioned policy changes with small changes to the $i$-th parameter of $\pi$. To obtain gradients showing how the diversity of a skill-conditioned policy changes with respect to $\lambda$, gradients are thus passed through each of the parameter-specific critics.
  • Figure 5: Illustration of trained latent-predictive, state encoding, and variational posterior distributions for a diverse skillset. Per the image, the latent-predictive models (black arrows) output $z_n$ that (i) match the output of the state encoding distribution (pink arrows) and (ii) are unique and can be decoded back to the original skill via the variational posterior (blue arrows).
  • ...and 13 more figures