Table of Contents
Fetching ...

IFG: Internet-Scale Guidance for Functional Grasping Generation

Ray Muxin Liu, Mingxuan Li, Kenneth Shaw, Deepak Pathak

TL;DR

IFG addresses the gap between semantic perception and precise 3D grasping by marrying vision-language grounded region understanding with geometry-driven force-closure synthesis. It generates a large semantically informed grasp dataset via simulation and distills it into a diffusion model that can synthesize executable grasps directly from depth input. Empirically, IFG improves single-object and crowded-scene grasp robustness and naturalness compared with baselines, while enabling real-time deployment without hand-collected data. By leveraging Basis Point Set conditioning and diffusion-based synthesis, this work demonstrates a scalable path toward semantic-conditioned, geometry-aware dexterous grasping in open-world environments.

Abstract

Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/

IFG: Internet-Scale Guidance for Functional Grasping Generation

TL;DR

IFG addresses the gap between semantic perception and precise 3D grasping by marrying vision-language grounded region understanding with geometry-driven force-closure synthesis. It generates a large semantically informed grasp dataset via simulation and distills it into a diffusion model that can synthesize executable grasps directly from depth input. Empirically, IFG improves single-object and crowded-scene grasp robustness and naturalness compared with baselines, while enabling real-time deployment without hand-collected data. By leveraging Basis Point Set conditioning and diffusion-based synthesis, this work demonstrates a scalable path toward semantic-conditioned, geometry-aware dexterous grasping in open-world environments.

Abstract

Large Vision Models trained on internet-scale data have demonstrated strong capabilities in segmenting and semantically understanding object parts, even in cluttered, crowded scenes. However, while these models can direct a robot toward the general region of an object, they lack the geometric understanding required to precisely control dexterous robotic hands for 3D grasping. To overcome this, our key insight is to leverage simulation with a force-closure grasping generation pipeline that understands local geometries of the hand and object in the scene. Because this pipeline is slow and requires ground-truth observations, the resulting data is distilled into a diffusion model that operates in real-time on camera point clouds. By combining the global semantic understanding of internet-scale models with the geometric precision of a simulation-based locally-aware force-closure, \our achieves high-performance semantic grasping without any manually collected training data. For visualizations of this please visit our website at https://ifgrasping.github.io/

Paper Structure

This paper contains 19 sections, 1 equation, 7 figures, 4 tables, 1 algorithm.

Figures (7)

  • Figure 1: IFG enables the generation of dexterous, functional grasps in cluttered, realistic scenes. It first uses a vision-language model to identify task-relevant regions on objects, then uses geometrically precise force closure in simulation to ground the finger joints. The resulting dataset, and the diffusion model trained on it, encode both semantic and geometric understanding of the scene without any hand-collected data.
  • Figure 2: IFG takes an object mesh and a task prompt as input. To incorporate semantic understanding, it renders the object from multiple viewpoints, applies a VLM-based segmentation model combining SAMkirillov2023segment and VLPartsun2023goingdenseropenvocabularysegmentation, and reprojects the results into 3D space to identify task-relevant regions. For geometric grounding, it initializes a force closure objective at these regions and optimizes for functional grasps. The resulting data is then used to train a diffusion model for fast grasp synthesis from depth.
  • Figure 3: Compared to Get a Grip's synthetic grasp generation method, our method produces more human-like grasps. For instance, Get a Grip often grasp on the bottom of the bottle, while our method knows to robustly grasp the neck. Please see our https://ifgrasping.github.io/ for 3D visualizations.
  • Figure 4: To enable real-world deployment, the generated grasp data is distilled into a diffusion model. This model is conditioned on a Basis Point Set (BPS) computed from depth camera data, along with a noisy grasp input. Through the denoising process, the model produces refined grasps on the object. The architecture of the diffusion model follows a similar design to DexDiffuser weng2024dexdiffuser.
  • Figure 5: Single Object evaluation in the Lift and Pick and Shake Task. Ours outperforms on the top three segmentation prompts compared to the Get a Grip baseline generation process due to the guidance that the prompt and the VLM provide on the grasping generation process.
  • ...and 2 more figures