Table of Contents
Fetching ...

Local Policies Enable Zero-shot Long-horizon Manipulation

Murtaza Dalal, Min Liu, Walter Talbott, Chen Chen, Deepak Pathak, Jian Zhang, Ruslan Salakhutdinov

TL;DR

ManipGen tackles zero-shot long-horizon manipulation by learning a library of local, pose-invariant manipulation skills in simulation and transferring them to the real world. It combines vision-language planning with fast motion planning and local policies learned through large-scale RL and distilled via DAgger, enabling interaction with unseen objects and configurations. In experiments spanning Robosuite simulations and five real-world environments, ManipGen achieves state-of-the-art or near state-of-the-art performance (e.g., 97% on Robosuite and 76% zero-shot real-world success across 50 tasks), outperforming SayCan, OpenVLA, LLMTrajGen, VoxPoser, and Transic baselines. The approach demonstrates strong generalization, robustness to perception perturbations, and practical potential for broad real-world manipulation tasks.

Abstract

Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at https://mihdalal.github.io/manipgen/

Local Policies Enable Zero-shot Long-horizon Manipulation

TL;DR

ManipGen tackles zero-shot long-horizon manipulation by learning a library of local, pose-invariant manipulation skills in simulation and transferring them to the real world. It combines vision-language planning with fast motion planning and local policies learned through large-scale RL and distilled via DAgger, enabling interaction with unseen objects and configurations. In experiments spanning Robosuite simulations and five real-world environments, ManipGen achieves state-of-the-art or near state-of-the-art performance (e.g., 97% on Robosuite and 76% zero-shot real-world success across 50 tasks), outperforming SayCan, OpenVLA, LLMTrajGen, VoxPoser, and Transic baselines. The approach demonstrates strong generalization, robustness to perception perturbations, and practical potential for broad real-world manipulation tasks.

Abstract

Sim2real for robotic manipulation is difficult due to the challenges of simulating complex contacts and generating realistic task distributions. To tackle the latter problem, we introduce ManipGen, which leverages a new class of policies for sim2real transfer: local policies. Locality enables a variety of appealing properties including invariances to absolute robot and object pose, skill ordering, and global scene configuration. We combine these policies with foundation models for vision, language and motion planning and demonstrate SOTA zero-shot performance of our method to Robosuite benchmark tasks in simulation (97%). We transfer our local policies from simulation to reality and observe they can solve unseen long-horizon manipulation tasks with up to 8 stages with significant pose, object and scene configuration variation. ManipGen outperforms SOTA approaches such as SayCan, OpenVLA, LLMTrajGen and VoxPoser across 50 real-world manipulation tasks by 36%, 76%, 62% and 60% respectively. Video results at https://mihdalal.github.io/manipgen/

Paper Structure

This paper contains 28 sections, 12 equations, 6 figures, 9 tables.

Figures (6)

  • Figure 1: Zero-shot Long-horizon Manipulation Our approach trains a library of generalist manipulation skills in simulation and transfers them zero-shot to long-horizon manipulation tasks. We show a single, text-conditioned agent can manipulate unseen objects, in arbitrary poses and scene configurations, across long-horizons in the real world, solving challenging manipulation tasks with complex obstacles.
  • Figure 2: Training Environments We train local policies (left to right) on picking, placing, handle grasping, opening and closing.
  • Figure 3: ManipGen Method Overview (left) Train 1000s of RL experts in simulation using PPO (middle) Distill single-task RL experts into generalist visuomotor policies via DAgger (right) Text-conditioned long-horizon manipulation via task decomposition (VLM), pose estimation and goal reaching (Motion Planning) and sim2real transfer of local policies
  • Figure 4: Depth Augmentation Visualization of edge artifacts and random holes on depth maps.
  • Figure 5: Example scene layouts for real world evaluation.
  • ...and 1 more figures