Table of Contents
Fetching ...

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

Kaidong Zhang, Pengzhen Ren, Bingqian Lin, Junfan Lin, Shikui Ma, Hang Xu, Xiaodan Liang

TL;DR

The PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks, and the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance.

Abstract

Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

PIVOT-R: Primitive-Driven Waypoint-Aware World Model for Robotic Manipulation

TL;DR

The PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks, and the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance.

Abstract

Language-guided robotic manipulation is a challenging task that requires an embodied agent to follow abstract user instructions to accomplish various complex manipulation tasks. Previous work trivially fitting the data without revealing the relation between instruction and low-level executable actions, these models are prone to memorizing the surficial pattern of the data instead of acquiring the transferable knowledge, and thus are fragile to dynamic environment changes. To address this issue, we propose a PrIrmitive-driVen waypOinT-aware world model for Robotic manipulation (PIVOT-R) that focuses solely on the prediction of task-relevant waypoints. Specifically, PIVOT-R consists of a Waypoint-aware World Model (WAWM) and a lightweight action prediction module. The former performs primitive action parsing and primitive-driven waypoint prediction, while the latter focuses on decoding low-level actions. Additionally, we also design an asynchronous hierarchical executor (AHE), which can use different execution frequencies for different modules of the model, thereby helping the model reduce computational redundancy and improve model execution efficiency. Our PIVOT-R outperforms state-of-the-art (SoTA) open-source models on the SeaWave benchmark, achieving an average relative improvement of 19.45% across four levels of instruction tasks. Moreover, compared to the synchronously executed PIVOT-R, the execution efficiency of PIVOT-R with AHE is increased by 28-fold, with only a 2.9% drop in performance. These results provide compelling evidence that our PIVOT-R can significantly improve both the performance and efficiency of robotic manipulation.

Paper Structure

This paper contains 33 sections, 2 equations, 11 figures, 10 tables.

Figures (11)

  • Figure 1: Comparison of PIVOT-R and other models. (a) Sequentially executed robot manipulation model. They sequentially execute each module in the model at each timestep to perform manipulation reasoning (e.g., RT-2 zitkovich2023rt2, RT-X vuong2023openrt-x, RT-H belkhale2024rt-h, VILA hu2023look, Octo octo_2023, etc.) or world modeling (e.g., Surfer Ren2023SurferPR, Daydreamer wu2023daydreamer, 3D-VLA zhen20243d, etc.) This easily leads to model redundancy and weak key manipulation node prediction capabilities. (b) PIVOT-R is a primitive-driven waypoint-aware world model with asynchronous hierarchical executors. It only focuses on the prediction of waypoints related to the manipulation task, and it is easier to predict key nodes in the manipulation task than other methods. In addition, PIVOT-R sets different execution frequencies for different modules to have higher execution efficiency and lower redundancy.
  • Figure 2: PIVOT-R overview. It mainly consists of a waypoint-aware world model (WAWM) and an action prediction module, where two modules cooperate with each other through an asynchronous hierarchical executor (AHE). In WAWM, we first use pre-trained VLM to perform low-frequency primitive action parsing on user instructions and provide waypoint indications for the scene prediction module. Then, the scene prediction module learns to model the world knowledge based on waypoints and manipulation trajectories. Finally, we use a lightweight action prediction module to perform high-frequency action prediction and execution.
  • Figure 2: Performance of different methods on three real robot manipulation tasks (%). “Pick up”: pick up the correct object from the table. “Put on”: Pick up the object and place it on the correct color block. “Push to”: Push the object to the correct color block.
  • Figure 3: Examples show the execution process of PIVOT-R. The text below the image describes the primitive actions to be performed next. Blue arrows indicate the direction of actions.
  • Figure 4: We show demonstrations of real world evaluation. The first row is "pick up the coke", the second row is "put the red bottle on the yellow block", and the third row is "push the object on the desk to the pink block".
  • ...and 6 more figures