Imitating Task and Motion Planning with Visuomotor Transformers

Murtaza Dalal; Ajay Mandlekar; Caelan Garrett; Ankur Handa; Ruslan Salakhutdinov; Dieter Fox

Imitating Task and Motion Planning with Visuomotor Transformers

Murtaza Dalal, Ajay Mandlekar, Caelan Garrett, Ankur Handa, Ruslan Salakhutdinov, Dieter Fox

TL;DR

This work introduces OPTIMUS, a system that distills a privileged Task and Motion Planning (TAMP) expert into fast, perception-based visuomotor Transformer policies trained offline from image observations. By generating large, diverse TAMP supervision data and applying a multi-view vision backbone with a probabilistic, multi-modal output, OPTIMUS achieves robust long-horizon manipulation across many objects and tasks, outperforming baselines and doubling the practical execution speed relative to TAMP. The approach demonstrates strong performance on complex tasks (up to 8 stages) and shows promising generalization to unseen configurations, while highlighting remaining challenges in scaling to highly multi-object scenarios and real-world transfer. Overall, OPTIMUS offers a scalable path to high-frequency, end-to-end manipulation policies by leveraging planning-generated data without requiring state information at test time.

Abstract

Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To that end, we present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. In this paper, we present a thorough study of the design decisions required to imitate TAMP and demonstrate that OPTIMUS can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates. Video results and code at https://mihdalal.github.io/optimus/

Imitating Task and Motion Planning with Visuomotor Transformers

TL;DR

Abstract

Paper Structure (19 sections, 9 figures, 10 tables)

This paper contains 19 sections, 9 figures, 10 tables.

Introduction
Preliminaries
Designing a TAMP Imitation System
Cost-Minimizing TAMP
Generating Imitation Data from TAMP
Training Imitation Policies at Scale
Experimental Evaluation
Learning Results
Limitations and Future Work
Table of Contents
Additional Learning Results
Ablations
Environments
Agent Structure
Experiment Details
...and 4 more sections

Figures (9)

Figure 1: Long-horizon task visualization. We visualize the initial state and each intermediate pick state for the pick-and-place task. Note there is significant variation in geometry across each object, requiring the agent to perform a diverse series of grasps to complete the task.
Figure 2: OPTIMUS system. Column 1: We generate a variety of tasks with differing initial configurations ( left) and goals ( right). Column 2: We transform TAMP joint space demonstrations to task space ( top), go from privileged scene knowledge in TAMP to visual observations ( middle) and prune TAMP demonstrations based on workspace constraints. Columns 3 and 4: We perform large-scale behavior cloning using a Transformer-based architecture and execute the visuomotor policies.
Figure 3: OPTIMUS policy architecture. The model takes as input multiple images and proprioception information per time-step, with a context of $h$. We encode the input using Resnet-18 for images and a MLP for the low-dimensional observations. We concatenate the embeddings, project them into the Transformer embedding dimension and pass them to the Transformer, which predicts an embedding that is decoded into an action.
Figure 4: Environment Visualizations. We evaluate OPTIMUS on long-horizon block stacking (a), multi-step pick-place (b), shelf object manipulation (c), and articulated object manipulation (d).
Figure 5: Long Horizon Manipulation Results. (left) Performance is shown in terms of task success rate. While all methods are able to solve single-step block stacking, only OPTIMUS is able to solve longer-horizon variants. (right) For long-horizon manipulation, while the baselines are competitive with OPTIMUS on PickPlaceTwo, OPTIMUS demonstrates significant improvement in success rate as the number of objects increases.
...and 4 more figures

Imitating Task and Motion Planning with Visuomotor Transformers

TL;DR

Abstract

Imitating Task and Motion Planning with Visuomotor Transformers

TL;DR

Abstract

Table of Contents

Figures (9)