Table of Contents
Fetching ...

Imitating Task and Motion Planning with Visuomotor Transformers

Murtaza Dalal, Ajay Mandlekar, Caelan Garrett, Ankur Handa, Ruslan Salakhutdinov, Dieter Fox

TL;DR

This work introduces OPTIMUS, a system that distills a privileged Task and Motion Planning (TAMP) expert into fast, perception-based visuomotor Transformer policies trained offline from image observations. By generating large, diverse TAMP supervision data and applying a multi-view vision backbone with a probabilistic, multi-modal output, OPTIMUS achieves robust long-horizon manipulation across many objects and tasks, outperforming baselines and doubling the practical execution speed relative to TAMP. The approach demonstrates strong performance on complex tasks (up to 8 stages) and shows promising generalization to unseen configurations, while highlighting remaining challenges in scaling to highly multi-object scenarios and real-world transfer. Overall, OPTIMUS offers a scalable path to high-frequency, end-to-end manipulation policies by leveraging planning-generated data without requiring state information at test time.

Abstract

Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To that end, we present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. In this paper, we present a thorough study of the design decisions required to imitate TAMP and demonstrate that OPTIMUS can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates. Video results and code at https://mihdalal.github.io/optimus/

Imitating Task and Motion Planning with Visuomotor Transformers

TL;DR

This work introduces OPTIMUS, a system that distills a privileged Task and Motion Planning (TAMP) expert into fast, perception-based visuomotor Transformer policies trained offline from image observations. By generating large, diverse TAMP supervision data and applying a multi-view vision backbone with a probabilistic, multi-modal output, OPTIMUS achieves robust long-horizon manipulation across many objects and tasks, outperforming baselines and doubling the practical execution speed relative to TAMP. The approach demonstrates strong performance on complex tasks (up to 8 stages) and shows promising generalization to unseen configurations, while highlighting remaining challenges in scaling to highly multi-object scenarios and real-world transfer. Overall, OPTIMUS offers a scalable path to high-frequency, end-to-end manipulation policies by leveraging planning-generated data without requiring state information at test time.

Abstract

Imitation learning is a powerful tool for training robot manipulation policies, allowing them to learn from expert demonstrations without manual programming or trial-and-error. However, common methods of data collection, such as human supervision, scale poorly, as they are time-consuming and labor-intensive. In contrast, Task and Motion Planning (TAMP) can autonomously generate large-scale datasets of diverse demonstrations. In this work, we show that the combination of large-scale datasets generated by TAMP supervisors and flexible Transformer models to fit them is a powerful paradigm for robot manipulation. To that end, we present a novel imitation learning system called OPTIMUS that trains large-scale visuomotor Transformer policies by imitating a TAMP agent. OPTIMUS introduces a pipeline for generating TAMP data that is specifically curated for imitation learning and can be used to train performant transformer-based policies. In this paper, we present a thorough study of the design decisions required to imitate TAMP and demonstrate that OPTIMUS can solve a wide variety of challenging vision-based manipulation tasks with over 70 different objects, ranging from long-horizon pick-and-place tasks, to shelf and articulated object manipulation, achieving 70 to 80% success rates. Video results and code at https://mihdalal.github.io/optimus/
Paper Structure (19 sections, 9 figures, 10 tables)

This paper contains 19 sections, 9 figures, 10 tables.

Figures (9)

  • Figure 1: Long-horizon task visualization. We visualize the initial state and each intermediate pick state for the pick-and-place task. Note there is significant variation in geometry across each object, requiring the agent to perform a diverse series of grasps to complete the task.
  • Figure 2: OPTIMUS system. Column 1: We generate a variety of tasks with differing initial configurations ( left) and goals ( right). Column 2: We transform TAMP joint space demonstrations to task space ( top), go from privileged scene knowledge in TAMP to visual observations ( middle) and prune TAMP demonstrations based on workspace constraints. Columns 3 and 4: We perform large-scale behavior cloning using a Transformer-based architecture and execute the visuomotor policies.
  • Figure 3: OPTIMUS policy architecture. The model takes as input multiple images and proprioception information per time-step, with a context of $h$. We encode the input using Resnet-18 for images and a MLP for the low-dimensional observations. We concatenate the embeddings, project them into the Transformer embedding dimension and pass them to the Transformer, which predicts an embedding that is decoded into an action.
  • Figure 4: Environment Visualizations. We evaluate OPTIMUS on long-horizon block stacking (a), multi-step pick-place (b), shelf object manipulation (c), and articulated object manipulation (d).
  • Figure 5: Long Horizon Manipulation Results. (left) Performance is shown in terms of task success rate. While all methods are able to solve single-step block stacking, only OPTIMUS is able to solve longer-horizon variants. (right) For long-horizon manipulation, while the baselines are competitive with OPTIMUS on PickPlaceTwo, OPTIMUS demonstrates significant improvement in success rate as the number of objects increases.
  • ...and 4 more figures