Table of Contents
Fetching ...

Path Integral Guided Policy Search

Yevgen Chebotar, Mrinal Kalakrishnan, Ali Yahya, Adrian Li, Stefan Schaal, Sergey Levine

TL;DR

This work extends guided policy search by replacing the model-based local optimizer with a model-free path-integral method (PI^2) to handle discontinuous contact dynamics. It also enables training on new task instances each iteration through on-policy global policy sampling, improving generalization for vision-based visuomotor policies. The approach yields deep neural policies that map from camera input to torque commands and demonstrates superior performance to LQR-based GPS on door opening and pick-and-place tasks, with strong gains in generalization when using random task instances. The combination of PI^2 with GPS and global policy sampling enables robust learning of complex manipulation skills directly from visual observations.

Abstract

We present a policy search method for learning complex feedback control policies that map from high-dimensional sensory inputs to motor torques, for manipulation tasks with discontinuous contact dynamics. We build on a prior technique called guided policy search (GPS), which iteratively optimizes a set of local policies for specific instances of a task, and uses these to train a complex, high-dimensional global policy that generalizes across task instances. We extend GPS in the following ways: (1) we propose the use of a model-free local optimizer based on path integral stochastic optimal control (PI2), which enables us to learn local policies for tasks with highly discontinuous contact dynamics; and (2) we enable GPS to train on a new set of task instances in every iteration by using on-policy sampling: this increases the diversity of the instances that the policy is trained on, and is crucial for achieving good generalization. We show that these contributions enable us to learn deep neural network policies that can directly perform torque control from visual input. We validate the method on a challenging door opening task and a pick-and-place task, and we demonstrate that our approach substantially outperforms the prior LQR-based local policy optimizer on these tasks. Furthermore, we show that on-policy sampling significantly increases the generalization ability of these policies.

Path Integral Guided Policy Search

TL;DR

This work extends guided policy search by replacing the model-based local optimizer with a model-free path-integral method (PI^2) to handle discontinuous contact dynamics. It also enables training on new task instances each iteration through on-policy global policy sampling, improving generalization for vision-based visuomotor policies. The approach yields deep neural policies that map from camera input to torque commands and demonstrates superior performance to LQR-based GPS on door opening and pick-and-place tasks, with strong gains in generalization when using random task instances. The combination of PI^2 with GPS and global policy sampling enables robust learning of complex manipulation skills directly from visual observations.

Abstract

We present a policy search method for learning complex feedback control policies that map from high-dimensional sensory inputs to motor torques, for manipulation tasks with discontinuous contact dynamics. We build on a prior technique called guided policy search (GPS), which iteratively optimizes a set of local policies for specific instances of a task, and uses these to train a complex, high-dimensional global policy that generalizes across task instances. We extend GPS in the following ways: (1) we propose the use of a model-free local optimizer based on path integral stochastic optimal control (PI2), which enables us to learn local policies for tasks with highly discontinuous contact dynamics; and (2) we enable GPS to train on a new set of task instances in every iteration by using on-policy sampling: this increases the diversity of the instances that the policy is trained on, and is crucial for achieving good generalization. We show that these contributions enable us to learn deep neural network policies that can directly perform torque control from visual input. We validate the method on a challenging door opening task and a pick-and-place task, and we demonstrate that our approach substantially outperforms the prior LQR-based local policy optimizer on these tasks. Furthermore, we show that on-policy sampling significantly increases the generalization ability of these policies.

Paper Structure

This paper contains 21 sections, 10 equations, 7 figures, 1 algorithm.

Figures (7)

  • Figure 1: Door opening and pick-and-place using our path integral guided policy search algorithm. Door opening can handle variability in the door pose, while the pick-and-place policy can handle various initial object poses.
  • Figure 2: The architecture of our neural network policy. The input RGB image is passed through a 3x3 convolution with stride 2 to generate 16 features at a lower resolution. The next 5 layers are 3x3 convolutions followed by 2x2 max-pooling, each of which output 32 features at successively reduced resolutions and increased receptive field. The outputs of these 5 layers are recombined by passing each of them into a 1x1 convolution, converting them to a size of 125x157 by using nearest-neighbor upscaling, and summation (similar to tompson2014joint). A final 1x1 convolution is used to generate 32 feature maps. The spatial soft-argmax operator Levine:2016 computes the expected 2D image coordinates of each feature. A fully connected layer is used to compute the object and robot pose from these expected 2D feature coordinates for pre-training the vision layers. The feature points for the current image are concatenated with feature points from the image at the first timestep as well as the 33-dimensional robot state vector, before being passed through two fully connected layers to produce the output joint torques.
  • Figure 3: Task setup and execution. Left: door opening task. Right: pick-and-place task. For both tasks, the pose of the object of interest (door or bottle) is randomized, and the robot must perform the task using monocular camera images from the camera mounted over the robot's shoulder.
  • Figure 4: Task adaptation success rates over the course of training with PI$^2$ and LQR for single instances of door opening and pick-and-place tasks. Each iteration consists of 10 trajectory samples.
  • Figure 5: Robot RGB camera images used for controlling the robot. Top: door opening task. Bottom: pick-and-place task.
  • ...and 2 more figures