Table of Contents
Fetching ...

Constrained Skill Discovery: Quadruped Locomotion with Unsupervised Reinforcement Learning

Vassil Atanassov, Wanming Yu, Alexander Luis Mitchell, Mark Nicholas Finean, Ioannis Havoutis

TL;DR

This work uses unsupervised reinforcement learning to learn a latent representation by maximizing the mutual information between skills and states subject to a distance constraint, and improves upon prior constrained skill discovery methods by replacing the latent transition maximization with a norm-matching objective.

Abstract

Representation learning and unsupervised skill discovery can allow robots to acquire diverse and reusable behaviors without the need for task-specific rewards. In this work, we use unsupervised reinforcement learning to learn a latent representation by maximizing the mutual information between skills and states subject to a distance constraint. Our method improves upon prior constrained skill discovery methods by replacing the latent transition maximization with a norm-matching objective. This not only results in a much a richer state space coverage compared to baseline methods, but allows the robot to learn more stable and easily controllable locomotive behaviors. We successfully deploy the learned policy on a real ANYmal quadruped robot and demonstrate that the robot can accurately reach arbitrary points of the Cartesian state space in a zero-shot manner, using only an intrinsic skill discovery and standard regularization rewards.

Constrained Skill Discovery: Quadruped Locomotion with Unsupervised Reinforcement Learning

TL;DR

This work uses unsupervised reinforcement learning to learn a latent representation by maximizing the mutual information between skills and states subject to a distance constraint, and improves upon prior constrained skill discovery methods by replacing the latent transition maximization with a norm-matching objective.

Abstract

Representation learning and unsupervised skill discovery can allow robots to acquire diverse and reusable behaviors without the need for task-specific rewards. In this work, we use unsupervised reinforcement learning to learn a latent representation by maximizing the mutual information between skills and states subject to a distance constraint. Our method improves upon prior constrained skill discovery methods by replacing the latent transition maximization with a norm-matching objective. This not only results in a much a richer state space coverage compared to baseline methods, but allows the robot to learn more stable and easily controllable locomotive behaviors. We successfully deploy the learned policy on a real ANYmal quadruped robot and demonstrate that the robot can accurately reach arbitrary points of the Cartesian state space in a zero-shot manner, using only an intrinsic skill discovery and standard regularization rewards.

Paper Structure

This paper contains 18 sections, 7 equations, 12 figures, 3 tables.

Figures (12)

  • Figure 1: We learn to map skill-conditioned state transitions to a latent space $\mathbb{Z}^2$ through an encoder $\phi(\cdot)$. By training a skill-conditioned policy together with the encoder, we learn a large range of locomotive skills on quadruped robots. Prior methods park_lipschitz-constrained_2021park_metra_2023 always maximize the latent transitions, leading to only learning less stable high-velocity motions regardless of the skill magnitude. In contrast, our proposed method learns a wider distribution of behaviors, which we can control by varying the magnitude of the sampled skills.
  • Figure 2: Learning scheme for our proposed approach. The encoder $\phi(\cdot)$ maps state transitions into a latent space optimized to match the skills (sampled from a predefined distribution $p(z)$), as shown by the MSE loss. An intrinsic reward is given to the agent based on the loss magnitude and an extrinsic reward from the environment, which encourages smooth behaviors.
  • Figure 3: Density distribution of the mean (across the episode) base velocity of 1000 trajectories with uniformly sampled skills, grouped into equally spaced bins in the range 03/. We show the results for the baseline LSD (in orange), METRA (in blue), and ours (in green). A broad distribution is a result of a larger skill space.
  • Figure 4: Comparison of XY base position trajectory (in meters) between ours, METRA, LSD, ASE, CASSI and DOMiNiC. To better illustrate the magnitudes of difference in the performance, we show the results with a fixed x- and y-axis scale across all algorithms. The colors indicate different skills (whether discrete or continuous). For ours, we show the performance when sampling skills with both the maximum magnitude, and with varying magnitudes in the first two plots, respectively.
  • Figure 5: After training, we can plan in the learned latent representation to condition the policy to reach desired states. We encode the current state ${\mathbf{s}}$ and desired state ${\mathbf{s}}_{\mathrm{des}}$ into the latent space, and use that as the conditioning skill for the policy.
  • ...and 7 more figures