Table of Contents
Fetching ...

Visual Robot Task Planning

Chris Paxton, Yotam Barnoy, Kapil Katyal, Raman Arora, Gregory D. Hager

TL;DR

The paper addresses the challenge of robotic planning under uncertainty by learning a latent world representation and a neural transition model that can forward-simulate high-level actions. It combines an encoder–decoder mapping with a transform function to predict future observations, and integrates these components with Monte Carlo Tree Search to generate and evaluate action sequences in unseen environments. Across navigation, block stacking, and suturing datasets, the approach yields realistic prospective futures and feasible plans, while providing interpretable visualizations of intermediate goals. This visual task planning framework demonstrates how learned representations from visual data can drive planning and explainable decision making in robotics.

Abstract

Prospection, the act of predicting the consequences of many possible futures, is intrinsic to human planning and action, and may even be at the root of consciousness. Surprisingly, this idea has been explored comparatively little in robotics. In this work, we propose a neural network architecture and associated planning algorithm that (1) learns a representation of the world useful for generating prospective futures after the application of high-level actions, (2) uses this generative model to simulate the result of sequences of high-level actions in a variety of environments, and (3) uses this same representation to evaluate these actions and perform tree search to find a sequence of high-level actions in a new environment. Models are trained via imitation learning on a variety of domains, including navigation, pick-and-place, and a surgical robotics task. Our approach allows us to visualize intermediate motion goals and learn to plan complex activity from visual information.

Visual Robot Task Planning

TL;DR

The paper addresses the challenge of robotic planning under uncertainty by learning a latent world representation and a neural transition model that can forward-simulate high-level actions. It combines an encoder–decoder mapping with a transform function to predict future observations, and integrates these components with Monte Carlo Tree Search to generate and evaluate action sequences in unseen environments. Across navigation, block stacking, and suturing datasets, the approach yields realistic prospective futures and feasible plans, while providing interpretable visualizations of intermediate goals. This visual task planning framework demonstrates how learned representations from visual data can drive planning and explainable decision making in robotics.

Abstract

Prospection, the act of predicting the consequences of many possible futures, is intrinsic to human planning and action, and may even be at the root of consciousness. Surprisingly, this idea has been explored comparatively little in robotics. In this work, we propose a neural network architecture and associated planning algorithm that (1) learns a representation of the world useful for generating prospective futures after the application of high-level actions, (2) uses this generative model to simulate the result of sequences of high-level actions in a variety of environments, and (3) uses this same representation to evaluate these actions and perform tree search to find a sequence of high-level actions in a new environment. Models are trained via imitation learning on a variety of domains, including navigation, pick-and-place, and a surgical robotics task. Our approach allows us to visualize intermediate motion goals and learn to plan complex activity from visual information.

Paper Structure

This paper contains 16 sections, 3 equations, 12 figures, 2 tables, 1 algorithm.

Figures (12)

  • Figure 1: Example of our algorithm using learned policies to predict a good sequence of actions. Left: initial observation $x_0$ and current observation $x_i$, plus corresponding encodings $h_0$ and $h_i$. Right: predicted results of three sequential high level actions.
  • Figure 2: Predicting the next step during a suturing task based on labeled surgical data. Predictions clearly show the next position of the arms.
  • Figure 3: Overview of the prediction network for visual task planning. We learn $f_{enc}(x)$, $f_{dec}(x)$, and $T(h, a)$ to be able to predict and visualize results of high-level actions.
  • Figure 4: Encoder-decoder architecture used for learning a transform into and out of the hidden space $h$.
  • Figure 5: Architecture of the transform function $T(h_0, h, a)$ for computing transformations to an action subgoal in the learned hidden space.
  • ...and 7 more figures