Table of Contents
Fetching ...

Dexterous Functional Grasping

Ananye Agarwal, Shagun Uppal, Kenneth Shaw, Deepak Pathak

TL;DR

Dexterous Functional Grasping addresses real-world, tool-oriented manipulation with a dexterous hand by integrating semantic affordances learned from internet data with a sim-trained low-level controller constrained by eigengrasp action spaces. The method localizes a functional region via a one-shot DINOv2/DETIC-based affordance model, then executes a bias-reduced, proprioception-driven grasp policy trained in simulation, followed by post-grasp trajectories for tool use. Empirical results show strong performance in both simulation and real-world trials across multiple objects, outperforming hardcoded baselines and, in several cases, beating a trained teleoperator. The work demonstrates that combining semantic priors with a reduced, physically plausible action space yields robust, transferable dexterous grasping suitable for in-the-wild tool use.

Abstract

While there have been significant strides in dexterous manipulation, most of it is limited to benchmark tasks like in-hand reorientation which are of limited utility in the real world. The main benefit of dexterous hands over two-fingered ones is their ability to pickup tools and other objects (including thin ones) and grasp them firmly to apply force. However, this task requires both a complex understanding of functional affordances as well as precise low-level control. While prior work obtains affordances from human data this approach doesn't scale to low-level control. Similarly, simulation training cannot give the robot an understanding of real-world semantics. In this paper, we aim to combine the best of both worlds to accomplish functional grasping for in-the-wild objects. We use a modular approach. First, affordances are obtained by matching corresponding regions of different objects and then a low-level policy trained in sim is run to grasp it. We propose a novel application of eigengrasps to reduce the search space of RL using a small amount of human data and find that it leads to more stable and physically realistic motion. We find that eigengrasp action space beats baselines in simulation and outperforms hardcoded grasping in real and matches or outperforms a trained human teleoperator. Results visualizations and videos at https://dexfunc.github.io/

Dexterous Functional Grasping

TL;DR

Dexterous Functional Grasping addresses real-world, tool-oriented manipulation with a dexterous hand by integrating semantic affordances learned from internet data with a sim-trained low-level controller constrained by eigengrasp action spaces. The method localizes a functional region via a one-shot DINOv2/DETIC-based affordance model, then executes a bias-reduced, proprioception-driven grasp policy trained in simulation, followed by post-grasp trajectories for tool use. Empirical results show strong performance in both simulation and real-world trials across multiple objects, outperforming hardcoded baselines and, in several cases, beating a trained teleoperator. The work demonstrates that combining semantic priors with a reduced, physically plausible action space yields robust, transferable dexterous grasping suitable for in-the-wild tool use.

Abstract

While there have been significant strides in dexterous manipulation, most of it is limited to benchmark tasks like in-hand reorientation which are of limited utility in the real world. The main benefit of dexterous hands over two-fingered ones is their ability to pickup tools and other objects (including thin ones) and grasp them firmly to apply force. However, this task requires both a complex understanding of functional affordances as well as precise low-level control. While prior work obtains affordances from human data this approach doesn't scale to low-level control. Similarly, simulation training cannot give the robot an understanding of real-world semantics. In this paper, we aim to combine the best of both worlds to accomplish functional grasping for in-the-wild objects. We use a modular approach. First, affordances are obtained by matching corresponding regions of different objects and then a low-level policy trained in sim is run to grasp it. We propose a novel application of eigengrasps to reduce the search space of RL using a small amount of human data and find that it leads to more stable and physically realistic motion. We find that eigengrasp action space beats baselines in simulation and outperforms hardcoded grasping in real and matches or outperforms a trained human teleoperator. Results visualizations and videos at https://dexfunc.github.io/
Paper Structure (19 sections, 1 equation, 11 figures, 4 tables)

This paper contains 19 sections, 1 equation, 11 figures, 4 tables.

Figures (11)

  • Figure 1: We accomplish functional grasping in the wild using a dexterous hand using a single policy to pickup and functionally grasp objects like hammers, drills, saucepan, staplers and screwdriver in different positions and orientations. We combine the strengths of both internet data and large-scale simulation. An affordance model based on matching DINOv2 features is used to localize the object and move close to the functional region of the object. A blind reactive policy then picks up the object and moves it inside the palm to a firm grasp so that post-grasp motions like drilling, hammering, etc can be executed. Even though the policy only sees hammers at training time (bottom), it generalizes to a much wider set at deployment. Videos at https://dexfunc.github.io/.
  • Figure 2: We divide the problem into three phases - pre-grasp, grasping and post-grasp. This combines large-scale data from both internet and simulation. Internet data helps to generalize to a large set of visually diverse objects and tells the robot 'where' to grasp. Simulation data allows training adaptive policies that work with objects of different physical properties and are even robust to errors in the pre-grasp. (1) To get the pre-grasp pose we use a one-shot affordance model. After annotating one object we are able to get affordances for other objects in that category via feature matching. Given a new object, the arm is moved to that point and oriented perpendicular to the principal component of the object mask. (2) Next, a policy trained in simulation is executed. We use a novel eigengrasp action space reduction to make training feasible. A small dataset of hand poses is colleted and 9 eigengrasps are extracted from it. The policy is trained in the linear space of these grasps.
  • Figure 3: Hardware setup with LEAP hand mounted on xarm6 with one D435 along each axis.
  • Figure 4: Affordance prediction for an upright drill from multiple angles. The best angle of approach is from the side and that is also the angle with highest affordance score. Our system picks this angle and then grasps.
  • Figure 5: The initial pre-grasp is wrong and the thumb gets stuck between palm and bottle, but the policy recovers and moves the thumb around to the correct grasp.
  • ...and 6 more figures