Table of Contents
Fetching ...

Unsupervised Visuomotor Control through Distributional Planning Networks

Tianhe Yu, Gleb Shevchuk, Dorsa Sadigh, Chelsea Finn

TL;DR

The paper tackles the challenge of enabling reinforcement learning for vision-based robots without hand-crafted rewards by learning a control-centric metric from unlabeled interaction. It introduces Distributional Planning Networks (DPN), an extension of Universal Planning Networks that models distributions over action sequences via latent variables and trains with amortized variational inference, yielding a goal metric usable for RL. After training, the encoder alone provides a latent space where progress toward a goal image can be measured, and SAC is used to learn policies guided by this learned reward. Across multiple simulated manipulation tasks and real-world robot experiments, DPN outperforms inverse-model, VAE, and pixel-based baselines, enabling autonomous reaching, pushing, and deformable-object manipulation without explicit rewards.

Abstract

While reinforcement learning (RL) has the potential to enable robots to autonomously acquire a wide range of skills, in practice, RL usually requires manual, per-task engineering of reward functions, especially in real world settings where aspects of the environment needed to compute progress are not directly accessible. To enable robots to autonomously learn skills, we instead consider the problem of reinforcement learning without access to rewards. We aim to learn an unsupervised embedding space under which the robot can measure progress towards a goal for itself. Our approach explicitly optimizes for a metric space under which action sequences that reach a particular state are optimal when the goal is the final state reached. This enables learning effective and control-centric representations that lead to more autonomous reinforcement learning algorithms. Our experiments on three simulated environments and two real-world manipulation problems show that our method can learn effective goal metrics from unlabeled interaction, and use the learned goal metrics for autonomous reinforcement learning.

Unsupervised Visuomotor Control through Distributional Planning Networks

TL;DR

The paper tackles the challenge of enabling reinforcement learning for vision-based robots without hand-crafted rewards by learning a control-centric metric from unlabeled interaction. It introduces Distributional Planning Networks (DPN), an extension of Universal Planning Networks that models distributions over action sequences via latent variables and trains with amortized variational inference, yielding a goal metric usable for RL. After training, the encoder alone provides a latent space where progress toward a goal image can be measured, and SAC is used to learn policies guided by this learned reward. Across multiple simulated manipulation tasks and real-world robot experiments, DPN outperforms inverse-model, VAE, and pixel-based baselines, enabling autonomous reaching, pushing, and deformable-object manipulation without explicit rewards.

Abstract

While reinforcement learning (RL) has the potential to enable robots to autonomously acquire a wide range of skills, in practice, RL usually requires manual, per-task engineering of reward functions, especially in real world settings where aspects of the environment needed to compute progress are not directly accessible. To enable robots to autonomously learn skills, we instead consider the problem of reinforcement learning without access to rewards. We aim to learn an unsupervised embedding space under which the robot can measure progress towards a goal for itself. Our approach explicitly optimizes for a metric space under which action sequences that reach a particular state are optimal when the goal is the final state reached. This enables learning effective and control-centric representations that lead to more autonomous reinforcement learning algorithms. Our experiments on three simulated environments and two real-world manipulation problems show that our method can learn effective goal metrics from unlabeled interaction, and use the learned goal metrics for autonomous reinforcement learning.

Paper Structure

This paper contains 18 sections, 10 equations, 8 figures, 1 algorithm.

Figures (8)

  • Figure 1: General overview of our method. Our method, DPN, enables autonomous reinforcement learning, without human-provided reward functions, on vision-based manipulation problems.
  • Figure 2: Diagram of our distributional planning networks model. Our model enables learning a representation $\mathbf{x}$ that induces a control-centric goal metric on images $\mathbf{o}$ from unlabeled interaction data. It does so by explicitly training for a metric under which gradient-based planning leads to the a sequence of actions that reach the final image. To effectively model the many action sequences that might lead to a goal after $T$ timesteps, we introduce latent variables $\mathbf{z}_{t:t+T}$ and train the model using amortized variational inference.
  • Figure 3: We conduct experiments on several different vision-based manipluation domains, including simluated rope manipulation, simulated pushing, robot reaching, and robot pushing in the real world.
  • Figure 4: Quantitative simulation results that evaluate the effectiveness of the goal metrics induced by each method by measuring the true distance to the goal state when running reinforcement learning with the reward derived from the learned goal metric. Performance is averaged across multiple tasks and error bars indicate standard error. Each RL step requires $20$ samples from the environment.
  • Figure 5: Comparisons of normalized latent distance to the goal determined by four approaches for the simulated rope manipulation task. We evaluate each latent metric on the trajectories (from a top-down view) of RL policy with respect to DPN, inverse model, VAE, and pixel space, shown above from left to right. Note in the leftmost plot that, though the metric learned by the inverse model achieves a lower normalized latent distance than the DPN metric, it goes to around $0$ once the gripper moves closer to its corresponding position in the goal image without touching the rope as shown in the second and fourth plot from the left. This suggests that the inverse model metric fails to capture the actual goal of task, which is directing the rope to the right form.
  • ...and 3 more figures