Visual Robot Task Planning
Chris Paxton, Yotam Barnoy, Kapil Katyal, Raman Arora, Gregory D. Hager
TL;DR
The paper addresses the challenge of robotic planning under uncertainty by learning a latent world representation and a neural transition model that can forward-simulate high-level actions. It combines an encoder–decoder mapping with a transform function to predict future observations, and integrates these components with Monte Carlo Tree Search to generate and evaluate action sequences in unseen environments. Across navigation, block stacking, and suturing datasets, the approach yields realistic prospective futures and feasible plans, while providing interpretable visualizations of intermediate goals. This visual task planning framework demonstrates how learned representations from visual data can drive planning and explainable decision making in robotics.
Abstract
Prospection, the act of predicting the consequences of many possible futures, is intrinsic to human planning and action, and may even be at the root of consciousness. Surprisingly, this idea has been explored comparatively little in robotics. In this work, we propose a neural network architecture and associated planning algorithm that (1) learns a representation of the world useful for generating prospective futures after the application of high-level actions, (2) uses this generative model to simulate the result of sequences of high-level actions in a variety of environments, and (3) uses this same representation to evaluate these actions and perform tree search to find a sequence of high-level actions in a new environment. Models are trained via imitation learning on a variety of domains, including navigation, pick-and-place, and a surgical robotics task. Our approach allows us to visualize intermediate motion goals and learn to plan complex activity from visual information.
