Table of Contents
Fetching ...

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

Pratyusha Sharma, Deepak Pathak, Abhinav Gupta

TL;DR

This work tackles third-person visual imitation learning from a single human demonstration without access to environment state. It proposes a decoupled hierarchical controller: a high-level goal generator translates third-person video into first-person sub-goals and a low-level inverse controller executes actions to achieve those goals, trained independently and run iteratively at test time. Applied to a real Baxter robot, the approach demonstrates improved generalization to unseen object positions, objects, and tasks compared with end-to-end and meta-learning baselines, as well as robust performance on pouring and placing tasks. The modular design enhances data efficiency and interpretability by enabling sub-goal visualization and shared low-level skills, with future work aimed at incorporating temporal structure and self-supervised data for further robustness.

Abstract

We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective. To accomplish this goal, our agent should not only learn to understand the intent of the demonstrated third-person video in its context but also perform the intended task in its environment configuration. Our central insight is to enforce this structure explicitly during learning by decoupling what to achieve (intended task) from how to perform it (controller). We propose a hierarchical setup where a high-level module learns to generate a series of first-person sub-goals conditioned on the third-person video demonstration, and a low-level controller predicts the actions to achieve those sub-goals. Our agent acts from raw image observations without any access to the full state information. We show results on a real robotic platform using Baxter for the manipulation tasks of pouring and placing objects in a box. Project video and code are at https://pathak22.github.io/hierarchical-imitation/

Third-Person Visual Imitation Learning via Decoupled Hierarchical Controller

TL;DR

This work tackles third-person visual imitation learning from a single human demonstration without access to environment state. It proposes a decoupled hierarchical controller: a high-level goal generator translates third-person video into first-person sub-goals and a low-level inverse controller executes actions to achieve those goals, trained independently and run iteratively at test time. Applied to a real Baxter robot, the approach demonstrates improved generalization to unseen object positions, objects, and tasks compared with end-to-end and meta-learning baselines, as well as robust performance on pouring and placing tasks. The modular design enhances data efficiency and interpretability by enabling sub-goal visualization and shared low-level skills, with future work aimed at incorporating temporal structure and self-supervised data for further robustness.

Abstract

We study a generalized setup for learning from demonstration to build an agent that can manipulate novel objects in unseen scenarios by looking at only a single video of human demonstration from a third-person perspective. To accomplish this goal, our agent should not only learn to understand the intent of the demonstrated third-person video in its context but also perform the intended task in its environment configuration. Our central insight is to enforce this structure explicitly during learning by decoupling what to achieve (intended task) from how to perform it (controller). We propose a hierarchical setup where a high-level module learns to generate a series of first-person sub-goals conditioned on the third-person video demonstration, and a low-level controller predicts the actions to achieve those sub-goals. Our agent acts from raw image observations without any access to the full state information. We show results on a real robotic platform using Baxter for the manipulation tasks of pouring and placing objects in a box. Project video and code are at https://pathak22.github.io/hierarchical-imitation/

Paper Structure

This paper contains 21 sections, 1 equation, 4 figures, 4 tables.

Figures (4)

  • Figure 1: We study the general setup of learning from demonstration with of goal of building an agent that is capable of imitating a single video of human demonstration to perform the task with novel objects and tasks. The figure shows an example of a third-person video demonstration on top and the robotic agent trying to imitate the setup with objects in front. As shown on the right, our approach is to decouple the learning process into a hierarchy of what (high-level) module to translate the third-person video to first-person sub-goals and how module (low-level) to achieve those sub-goals.
  • Figure 2: Decoupled Hierarchical Control for Third-Person Visual Imitation Learning: We introduce a hierarchical approach consisting of a goal generator that predicts a goal visual state which is then used by the low-level controller as guidance to achieve a task. [Left] During training, the decoupled models are trained independently. The goal generator takes as input the human video frames $h_t$ and $h_{t+k}$ along with the observed robot state $s_t$ to predict the visual goal state of the robot at $t+k$. The low level controller is trained using $s_t$,$a_t$,$s_{t+1}$ triplets. [Right] At inference, the models are executed one after the other in a loop. After reaching the current goal, the goal generator uses the new observed state $s_{t+1}$ and the next images of the human video to generate a new goal for the low-level controller to attain.
  • Figure 3: (a) The Goal Generator: The high-level goal generator network $\pi_H(.)$ takes as input the frames of the human demonstration video $h_t,h_{t+k}$ and the current observed state of the robot $s_t$ at time $t$. It is trained to generate the visual representation $s_{t+k}$ of the robot at time $t+k$. Instead of the complex goal image generation problem, our setup reduces the setup into a simpler re-rendering problem, i.e., move the pixels of robot image in the similar to the change in human demonstration images. (b) Low-level Controller: The inputs to the low-level controller are the observed state of the robot $s_t$ and goal state of the robot $s_{t+1}$. The model is trained to output the action ($a_t$) that will cause it to transition to the goal state from $s_t$.
  • Figure 4: (a) Goal Generator Comparison: The predictions of the outputs generated by the goal generator when optimized using different methods. Our model, which is trained to translate the robot's current image instead of generating from scratch, generates the sharpest and accurate results. (b) Sensitivity Analysis of the Goal Generator: Given the input human demonstration of a task, we test the sensitivity of goal-generate wrt object locations. Our model can hallucinate accurate sub-goals in accordance with the object location. (c) Goal Generator Predictions: The images in the first row are the input observed robot states. The second row contains goals generated by the goal generator from the input images. The predictions are at an interval of ten steps (approx. 2sec) ahead into the future. As shown, predicted sub-goals are consistent across the trajectory.