Table of Contents
Fetching ...

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

Jiafei Duan, Wentao Yuan, Wilbert Pumacay, Yi Ru Wang, Kiana Ehsani, Dieter Fox, Ranjay Krishna

TL;DR

Robotics data collection is a major bottleneck for scalable learning. Manipulate-Anything leverages vision-language and language models to automatically generate diverse, zero-shot robot demonstrations in real-world settings without privileged state information or hand-crafted skills, enabling robust behavior cloning. Across simulation and real-world tasks, it outperforms baselines and produces training data that rivals or exceeds human demonstrations in several settings. The framework demonstrates scalable data generation and zero-shot task solving, with practical impact for rapid deployment of robotic manipulation policies.

Abstract

Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe Manipulate-Anything can be a scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Project page: https://robot-ma.github.io/.

Manipulate-Anything: Automating Real-World Robots using Vision-Language Models

TL;DR

Robotics data collection is a major bottleneck for scalable learning. Manipulate-Anything leverages vision-language and language models to automatically generate diverse, zero-shot robot demonstrations in real-world settings without privileged state information or hand-crafted skills, enabling robust behavior cloning. Across simulation and real-world tasks, it outperforms baselines and produces training data that rivals or exceeds human demonstrations in several settings. The framework demonstrates scalable data generation and zero-shot task solving, with practical impact for rapid deployment of robotic manipulation policies.

Abstract

Large-scale endeavors like and widespread community efforts such as Open-X-Embodiment have contributed to growing the scale of robot demonstration data. However, there is still an opportunity to improve the quality, quantity, and diversity of robot demonstration data. Although vision-language models have been shown to automatically generate demonstration data, their utility has been limited to environments with privileged state information, they require hand-designed skills, and are limited to interactions with few object instances. We propose Manipulate-Anything, a scalable automated generation method for real-world robotic manipulation. Unlike prior work, our method can operate in real-world environments without any privileged state information, hand-designed skills, and can manipulate any static object. We evaluate our method using two setups. First, Manipulate-Anything successfully generates trajectories for all 7 real-world and 14 simulation tasks, significantly outperforming existing methods like VoxPoser. Second, Manipulate-Anything's demonstrations can train more robust behavior cloning policies than training with human demonstrations, or from data generated by VoxPoser, Scaling-up, and Code-As-Policies. We believe Manipulate-Anything can be a scalable method for both generating data for robotics and solving novel tasks in a zero-shot setting. Project page: https://robot-ma.github.io/.

Paper Structure

This paper contains 14 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Manipulate-Anything is an automated method for robot manipulation in real world environments. Unlike prior methods, it does not require privileged state information, hand-designed skills, or limited to manipulating a fixed number of object instances. It can guide a robot to accomplish a diverse set of unseen tasks, manipulating diverse objects. Furthermore, the generated data enables training behavior cloning policies that outperform training with human demonstrations.
  • Figure 2: Manipulate Anything Framework. The process begins by inputting a scene representation and a natural language task instruction into a VLM, which identifies objects and determines sub-tasks. For each sub-task, we provide multi-view images, verification conditions, and task goals to the action generation module, producing a task-specific grasp pose or action code. This leads to a temporary goal state, assessed by the sub-task verification module for error recovery. Once all sub-tasks are achieved, we filter the trajectories to obtain successful demonstrations for downstream policy training.
  • Figure 3: Manipulate-Anything is an open-vocabulary autonomous robot demonstration generation system. We show zero-shot demonstrations for 14 tasks in simulation and 7 tasks in the real world.
  • Figure 4: Scaling experiment. Scaling effect of model performance with increasing training demonstrations.
  • Figure 5: Action Distribution for Generated Data: We compare the action distribution of data generated by various methods against human-generated demonstrations via RLBench on the same set of tasks. We observed a high similarity between the distribution of our generated data and the human-generated data. This is further supported by the computed CD between our methods and the RLBench data, which yields the lowest (CD=0.056).
  • ...and 1 more figures