Table of Contents
Fetching ...

Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks

Hao Jiang, Yue Wu, Yue Wang, Gaurav S. Sukhatme, Daniel Seita

Abstract

Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.

Concurrent Prehensile and Nonprehensile Manipulation: A Practical Approach to Multi-Stage Dexterous Tasks

Abstract

Dexterous hands enable concurrent prehensile and nonprehensile manipulation, such as holding one object while interacting with another, a capability essential for everyday tasks yet underexplored in robotics. Learning such long-horizon, contact-rich multi-stage behaviors is challenging because demonstrations are expensive to collect and end-to-end policies require substantial data to generalize across varied object geometries and placements. We present DexMulti, a sample-efficient approach for real-world dexterous multi-task manipulation that decomposes demonstrations into object-centric skills with well-defined temporal boundaries. Rather than learning monolithic policies, our method retrieves demonstrated skills based on current object geometry, aligns them to the observed object state using an uncertainty-aware estimator that tracks centroid and yaw, and executes them via a retrieve-align-execute paradigm. We evaluate on three multi-stage tasks requiring concurrent manipulation (Grasp + Pull, Grasp + Open, and Grasp + Grasp) across two dexterous hands (Allegro and LEAP) in over 1,000 real-world trials. Our approach achieves an average success rate of 66% on training objects with only 3-4 demonstrations per object, outperforming diffusion policy baselines by 2-3x while requiring far fewer demonstrations. Results demonstrate robust generalization to held-out objects and spatial variations up to +/-25 cm.
Paper Structure (53 sections, 4 equations, 6 figures, 3 tables, 1 algorithm)

This paper contains 53 sections, 4 equations, 6 figures, 3 tables, 1 algorithm.

Figures (6)

  • Figure 1: Our method enables real-world dexterous multi-stage manipulation requiring concurrent prehensile and nonprehensile interaction. We demonstrate three challenging tasks: (Top) Grasp+Pull: grasping an object and pulling open a drawer while maintaining the grasp; (Middle) Grasp+Open: grasping an object and opening a container lid; (Bottom) Grasp+Grasp: sequentially grasping two objects without releasing the first. Each row shows key frames from a successful rollout, highlighting the robot's ability to stably hold one object while manipulating another. More videos and code are available on our anonymous website https://dexmulti.github.io/.
  • Figure 2: Overview of the proposed pipeline. (1) Perception: Multi-view RGB-D with language-conditioned segmentation produces object-centric point clouds and contact maps. (2) Offline: Demonstrations are segmented into object-centric skills using interaction signals and stored in canonical form. (3) Online: An uncertainty-aware estimator tracks object centroid and yaw. Skills are retrieved via point-cloud matching, aligned under pose uncertainty, and executed via retrieve--align--execute. Panels (A–D) correspond to Sec. IV-A--D.
  • Figure 3: Experimental setup. We use an xArm 7 robot arm with a dexterous hand (16-DoF LEAP or Allegro Hand) and two Intel RealSense L515 RGB-D cameras for multi-view perception.
  • Figure 4: Objects used in our experiments. Left: training objects for demonstration collection. Right: held-out test objects used exclusively for evaluation, and unseen during training. The set spans a diverse range of geometries, sizes, and interaction affordances.
  • Figure 5: Qualitative robustness and failure analysis. Top row: our method remains effective under external perturbations. Bottom-left: failure in the first task stage during banana grasping, where stable force balance is difficult. Bottom-right: failure in the pulling stage, where the hand does not contact the handle at the correct location.
  • ...and 1 more figures