Planning from Observation and Interaction

Tyler Han; Siyang Shen; Rohan Baijal; Harine Ravichandiran; Bat Nemekhbold; Kevin Huang; Sanghun Jung; Byron Boots

Planning from Observation and Interaction

Tyler Han, Siyang Shen, Rohan Baijal, Harine Ravichandiran, Bat Nemekhbold, Kevin Huang, Sanghun Jung, Byron Boots

TL;DR

This work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone, and demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch.

Abstract

Observational learning requires an agent to learn to perform a task by referencing only observations of the performed task. This work investigates the equivalent setting in real-world robot learning where access to hand-designed rewards and demonstrator actions are not assumed. To address this data-constrained setting, this work presents a planning-based Inverse Reinforcement Learning (IRL) algorithm for world modeling from observation and interaction alone. Experiments conducted entirely in the real-world demonstrate that this paradigm is effective for learning image-based manipulation tasks from scratch in under an hour, without assuming prior knowledge, pre-training, or data of any kind beyond task observations. Moreover, this work demonstrates that the learned world model representation is capable of online transfer learning in the real-world from scratch. In comparison to existing approaches, including IRL, RL, and Behavior Cloning (BC), which have more restrictive assumptions, the proposed approach demonstrates significantly greater sample efficiency and success rates, enabling a practical path forward for online world modeling and planning from observation and interaction. Videos and more at: https://uwrobotlearning.github.io/mpail2/.

Planning from Observation and Interaction

TL;DR

Abstract

Paper Structure (25 sections, 11 equations, 14 figures, 7 tables, 1 algorithm)

This paper contains 25 sections, 11 equations, 14 figures, 7 tables, 1 algorithm.

Introduction
Related Work
Model Predictive Adversarial Imitation Learning 2
Problem
Encoder & Dynamics
Inferred Reward
Value
Policy
Planner
Experiments
IRL Methods as Ablations
Reinforcement Learning with Prior Data (RLPD)
Behavior Cloning (Diffusion)
Simulated Sample-Efficiency Experiments
Real-World Sample-Efficiency Experiments
...and 10 more sections

Figures (14)

Figure 1: Overview of MPAIL2. (0) The learner observes a task demonstration and stores the observations before training. (1) The learner observes the world, and encodes the observation into its current latent state, $z_0 = e(o)$. The learner's policy $\pi$ suggests reactive, possibly suboptimal actions (purple dotted lines). Before executing any actions, the learner predicts and evaluates the world at future latent states to derive a plan by using the pictured component models. This process involves: (a) randomly sampling action sequences (i.e. plans) for evaluation (not pictured); (b) predicting trajectories implied by sampled plans using the dynamics model $z'=f(z,a)$ (dotted lines); (c) and finally, predicting the total return of a plan by adding up the rewards of its implied state-transitions $r(z,z')$ with terminal value bootstrapping $Q(z,a)$. (2) An action $a$ is executed according to plans with higher returns; the world's response to the action is observed $o'$; and the interaction $(o,a,o')$ is accumulated in the learner's experience. Learning occurs by updating all of the component models over the collective experience. Initial task observations are used only to update the reward model.
Figure 2: Top. Simulation of MPAIL2 on the Block Push (state) task at three time steps along an episode. Bottom. Predicted plans in the XY plane (top-down) for the next one second at each time step. End-effector plans are drawn in green. Block plans are drawn in orange. Block trajectories, static in the leftmost frame, become dynamic as the robot approaches and makes contact. These predictions show how the agent gradually learns to plan over physically causal relationships, like contact. Note: planning occurs in latent space. This visualization is made possible by a separately trained decoder.
Figure 3: Overview of evaluation tasks. We evaluate our method and baselines on 4 tasks - 2 in sim and 2 in real. The Pick and Place tasks (a,c) involve reaching the block, grasping it, lifting and placing it beyond a target line. The Push tasks (b,d) involve reaching the block and pushing it beyond a target line.
Figure 4: Cumulative successes in simulated experiments. Offline results of BC are shown in \ref{['tab:sim-eval-summary']}.
Figure 5: Cumulative successes in real-world experiments. Offline results of BC are shown in \ref{['tab:real-eval-summary']}. Complete training time for one training run (including resets, computation, etc.) is approximately 90 minutes for Block Push and 70 minutes for Pick and Place.
...and 9 more figures

Planning from Observation and Interaction

TL;DR

Abstract

Planning from Observation and Interaction

Authors

TL;DR

Abstract

Table of Contents

Figures (14)