Table of Contents
Fetching ...

AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation

Jinxuan Zhu, Chenrui Tie, Xinyi Cao, Yuran Wang, Jingxiang Guo, Zixuan Chen, Haonan Chen, Junting Chen, Yangyu Xiao, Ruihai Wu, Lin Shao

TL;DR

AdaptPNP tackles the challenge of unifying prehensile and non-prehensile robotic manipulation by integrating a vision-language model (VLM) planner with a physics-aware digital twin and a closed-loop reflection mechanism. The framework generates high-level plan skeletons from visual and textual inputs, grounds each primitive into precise 6D object sub-goals in a digital twin, and executes via low-level controllers, refining plans online based on execution feedback. Its key innovations are the 6D pose intermediate representation bridging planning and execution, and the reflection loop that adaptively re-plans in the face of physical infeasibility or unexpected dynamics. Demonstrations in simulation and real-world tests show superior performance over baselines, with robust sim-to-real transfer and the ability to handle diverse tasks requiring hybrid P&NP strategies, signaling a significant step toward versatile, human-level robotic manipulation.

Abstract

Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi-step plans. We introduce ApaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accomplish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task descriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital-twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipulation sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate ApaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. Project Website: https://sites.google.com/view/adaptpnp/home

AdaptPNP: Integrating Prehensile and Non-Prehensile Skills for Adaptive Robotic Manipulation

TL;DR

AdaptPNP tackles the challenge of unifying prehensile and non-prehensile robotic manipulation by integrating a vision-language model (VLM) planner with a physics-aware digital twin and a closed-loop reflection mechanism. The framework generates high-level plan skeletons from visual and textual inputs, grounds each primitive into precise 6D object sub-goals in a digital twin, and executes via low-level controllers, refining plans online based on execution feedback. Its key innovations are the 6D pose intermediate representation bridging planning and execution, and the reflection loop that adaptively re-plans in the face of physical infeasibility or unexpected dynamics. Demonstrations in simulation and real-world tests show superior performance over baselines, with robust sim-to-real transfer and the ability to handle diverse tasks requiring hybrid P&NP strategies, signaling a significant step toward versatile, human-level robotic manipulation.

Abstract

Non-prehensile (NP) manipulation, in which robots alter object states without forming stable grasps (for example, pushing, poking, or sliding), significantly broadens robotic manipulation capabilities when grasping is infeasible or insufficient. However, enabling a unified framework that generalizes across different tasks, objects, and environments while seamlessly integrating non-prehensile and prehensile (P) actions remains challenging: robots must determine when to invoke NP skills, select the appropriate primitive for each context, and compose P and NP strategies into robust, multi-step plans. We introduce ApaptPNP, a vision-language model (VLM)-empowered task and motion planning framework that systematically selects and combines P and NP skills to accomplish diverse manipulation objectives. Our approach leverages a VLM to interpret visual scene observations and textual task descriptions, generating a high-level plan skeleton that prescribes the sequence and coordination of P and NP actions. A digital-twin based object-centric intermediate layer predicts desired object poses, enabling proactive mental rehearsal of manipulation sequences. Finally, a control module synthesizes low-level robot commands, with continuous execution feedback enabling online task plan refinement and adaptive replanning through the VLM. We evaluate ApaptPNP across representative P&NP hybrid manipulation tasks in both simulation and real-world environments. These results underscore the potential of hybrid P&NP manipulation as a crucial step toward general-purpose, human-level robotic manipulation capabilities. Project Website: https://sites.google.com/view/adaptpnp/home

Paper Structure

This paper contains 25 sections, 4 figures, 4 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of AdaptPNP. A VLM-based task planner generates a mixed sequence of prehensile (grasp, moveto, release) and non-prehensile (push, rotate) primitives; a digital twin "mentally rehearses" each primitive by generating target 6D object poses; and a closed-loop reflection mechanism uses execution feedback to iteratively refine the plan.
  • Figure 2: Pipeline of AdaptPNP. Starting from an instruction and scene image, the task planner generates an initial plan ($e.g.$, direct push), which is mentally rehearsed in the digital twin to sample a 6D target pose. After execution fails, the reflector analyzes the error and provides insight to the planner, which replans ($e.g.$, grasp-and-move). This loop continues until the successful plan (push-to-edge-then-grasp) completes the task.
  • Figure 3: Task Setup. We evaluate AdaptPNP on a spectrum of P&NP hybrid manipulation scenarios, including eight simulated tasks (top two rows) and four real-world tasks (bottom row). In each scene, the final target pose is shown as a translucent object, and the target region is indicated by a yellow overlay ($e.g.$, Pusher, Hook).
  • Figure 4: Real-World Task Process Visualization. The goal is to place the box at the translucent target pose. Direct grasp fails because the box is slightly wider than the gripper. AdaptPNP replans by first pushing the box to the table edge and then grasping it from the side, successfully reaching the goal.