Table of Contents
Fetching ...

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

William Shen, Nishanth Kumar, Sahit Chintalapudi, Jie Wang, Christopher Watson, Edward Hu, Jing Cao, Dinesh Jayaraman, Leslie Pack Kaelbling, Tomás Lozano-Pérez

TL;DR

This work evaluates TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and finds it matches or outperforms $\pi_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations.

Abstract

We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms $π_{0.5}\text{-DROID}$, a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io

TiPToP: A Modular Open-Vocabulary Planning System for Robotic Manipulation

TL;DR

This work evaluates TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and finds it matches or outperforms , a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations.

Abstract

We present TiPToP, an extensible modular system that combines pretrained vision foundation models with an existing Task and Motion Planner (TAMP) to solve multi-step manipulation tasks directly from input RGB images and natural-language instructions. Our system aims to be simple and easy-to-use: it can be installed and run on a standard DROID setup in under one hour and adapted to new embodiments with minimal effort. We evaluate TiPToP -- which requires zero robot data -- over 28 tabletop manipulation tasks in simulation and the real world and find it matches or outperforms , a vision-language-action (VLA) model fine-tuned on 350 hours of embodiment-specific demonstrations. TiPToP's modular architecture enables us to analyze the system's failure modes at the component level. We analyze results from an evaluation of 173 trials and identify directions for improvement. We release TiPToP open-source to further research on modular manipulation systems and tighter integration between learning and planning. Project website and code: https://tiptop-robot.github.io
Paper Structure (24 sections, 3 equations, 5 figures, 2 tables)

This paper contains 24 sections, 3 equations, 5 figures, 2 tables.

Figures (5)

  • Figure 1: TiPToP System Overview. TiPToP takes a stereo RGB image pair and a natural language instruction $\mathcal{L}$ as input and outputs robot joint trajectories with gripper commands. (a) The perception module constructs an object-centric 3D scene representation using learned depth estimation, grasp prediction, object detection, and segmentation. (b) The planning module uses GPU-parallelized TAMP (cuTAMP) to find feasible manipulation plans. (c) The execution module tracks the planned trajectory using a joint impedance controller.
  • Figure 2: Perception Module. (a) Depth map predicted by FoundationStereo with sharp object boundaries. (b) Grasps predicted by M2T2 on the scene point cloud (colors correspond to grasp confidences). (c) Labeled object bounding boxes and symbolic goal $\mathcal{G}$ predicted by Gemini ($\text{On}(a, b)$ specifies that object $a$ should be placed on object or surface $b$).
  • Figure 3: Wiping. We demonstrate that TiPToP can be straightforwardly extended to perform wiping in addition to pick-and-place. Task instruction: "erase the whiteboard and put everything into the bowl".
  • Figure 4: Failure Analysis. Sankey diagram showing outcomes of 173 trials. The most common failure modes are grasping failures (missed or unstable grasps), followed by scene completion errors, VLM detection errors, then cuTAMP failures.
  • Figure 5: Object Segmentation. SAM-2 generates eight pixel-level segmentation masks from the bounding boxes in Fig. \ref{['fig:perception-pipeline']}c.