Table of Contents
Fetching ...

GraspXL: Generating Grasping Motions for Diverse Objects at Scale

Hui Zhang, Sammy Christen, Zicong Fan, Otmar Hilliges, Jie Song

TL;DR

GraspXL presents a scalable reinforcement-learning framework for generating grasping motions that satisfy multiple objectives across diverse objects and dexterous hands without relying on hand–object interaction data. It combines objective-driven hand guidance, a curriculum learning strategy, and distance-based object features to generalize to over 500k unseen objects, including generated or reconstructed meshes. The approach achieves high grasp success and close adherence to targets on PartNet, ShapeNet, and Objaverse, while transferring across MANO, Shadow, Allegro, and Faive hands. The authors release code, pretrained policies, and a large-scale dataset of generated grasp motions to enable downstream research and applications.

Abstract

Human hands possess the dexterity to interact with diverse objects such as grasping specific parts of the objects and/or approaching them from desired directions. More importantly, humans can grasp objects of any shape without object-specific skills. Recent works synthesize grasping motions following single objectives such as a desired approach heading direction or a grasping area. Moreover, they usually rely on expensive 3D hand-object data during training and inference, which limits their capability to synthesize grasping motions for unseen objects at scale. In this paper, we unify the generation of hand-object grasping motions across multiple motion objectives, diverse object shapes and dexterous hand morphologies in a policy learning framework GraspXL. The objectives are composed of the graspable area, heading direction during approach, wrist rotation, and hand position. Without requiring any 3D hand-object interaction data, our policy trained with 58 objects can robustly synthesize diverse grasping motions for more than 500k unseen objects with a success rate of 82.2%. At the same time, the policy adheres to objectives, which enables the generation of diverse grasps per object. Moreover, we show that our framework can be deployed to different dexterous hands and work with reconstructed or generated objects. We quantitatively and qualitatively evaluate our method to show the efficacy of our approach. Our model, code, and the large-scale generated motions are available at https://eth-ait.github.io/graspxl/.

GraspXL: Generating Grasping Motions for Diverse Objects at Scale

TL;DR

GraspXL presents a scalable reinforcement-learning framework for generating grasping motions that satisfy multiple objectives across diverse objects and dexterous hands without relying on hand–object interaction data. It combines objective-driven hand guidance, a curriculum learning strategy, and distance-based object features to generalize to over 500k unseen objects, including generated or reconstructed meshes. The approach achieves high grasp success and close adherence to targets on PartNet, ShapeNet, and Objaverse, while transferring across MANO, Shadow, Allegro, and Faive hands. The authors release code, pretrained policies, and a large-scale dataset of generated grasp motions to enable downstream research and applications.

Abstract

Human hands possess the dexterity to interact with diverse objects such as grasping specific parts of the objects and/or approaching them from desired directions. More importantly, humans can grasp objects of any shape without object-specific skills. Recent works synthesize grasping motions following single objectives such as a desired approach heading direction or a grasping area. Moreover, they usually rely on expensive 3D hand-object data during training and inference, which limits their capability to synthesize grasping motions for unseen objects at scale. In this paper, we unify the generation of hand-object grasping motions across multiple motion objectives, diverse object shapes and dexterous hand morphologies in a policy learning framework GraspXL. The objectives are composed of the graspable area, heading direction during approach, wrist rotation, and hand position. Without requiring any 3D hand-object interaction data, our policy trained with 58 objects can robustly synthesize diverse grasping motions for more than 500k unseen objects with a success rate of 82.2%. At the same time, the policy adheres to objectives, which enables the generation of diverse grasps per object. Moreover, we show that our framework can be deployed to different dexterous hands and work with reconstructed or generated objects. We quantitatively and qualitatively evaluate our method to show the efficacy of our approach. Our model, code, and the large-scale generated motions are available at https://eth-ait.github.io/graspxl/.
Paper Structure (23 sections, 6 equations, 6 figures, 11 tables)

This paper contains 23 sections, 6 equations, 6 figures, 11 tables.

Figures (6)

  • Figure 1: Large-scale Grasping Synthesis. Our method, GraspXL, can be used to generate large-scale grasps with robotic hands, and the MANO hand model. Here we show large-scale generated results, better viewed when zoomed in.
  • Figure 2: Objective-driven Grasping Motion Synthesis. Given a hand model and an object, our goal is to synthesize grasp motions that adhere to high-level objectives, which may consist of one or multiple objectives including graspable areas (indicated by the shadow), heading directions (indicated by the red arrow), wrist rotations (indicated by the black arrow), and positions of the hand (indicated by the green dot). For each sequence, the darker hand represents more recent in time.
  • Figure 3: Overview of GraspXL. As shown in the top row, our method can utilize captured, generated, or reconstructed objects, and different dexterous hand platforms such as MANO, Shadow, Allegro or Faive. With given object and hand model, the policy takes different objectives and states as inputs (on the left), and outputs dynamic grasp motions according to the specific objectives (on the right, accordingly, where darker hand represents more recent in time). The objectives can be the heading direction, wrist rotation, hand position or graspable area, and the states contain the hand state, contact, force and distance of each link with the object, and the object point cloud.
  • Figure 4: Task definition. (a) The local coordinate system of the hand where the x-axis is the heading direction $\mathbf{v}$, the origin is the midpoint position $\mathbf{m}$ (see text for definition), the rotation about $\mathbf{v}$ is $\omega$. (b) Given an object with user-specified graspable $\{\mathbf{o}^+_j\}$ and non-graspable vertices $\{\mathbf{o}^-_j\}$ (labelled in red and blue), the goal of the agent is to approach and grasp the object while satisfying motion objectives $\bar{\mathbf{v}}$, $\bar{\mathbf{m}}$, $\bar{\omega}$, and contact with the graspable area $\{\mathbf{o}^+_j\}$.
  • Figure 5: Generated Motions of Different Hands with the Same Objectives. We require the hands to approach from the right and grasp the upper part of the glass.
  • ...and 1 more figures