Table of Contents
Fetching ...

GRIM: Task-Oriented Grasping with Conditioning on Generative Examples

Shailesh, Alok Raj, Nayan Kumar, Priya Shukla, Andrew Melnik, Michael Beetz, Gora Chand Nandi

TL;DR

GRIM tackles data scarcity in Task-Oriented Grasping by adopting a training-free, memory-driven approach that retrieves functional priors from heterogeneous sources and aligns them in 3D through semantic cues. The framework leverages a retrieve-align-transfer pipeline: memory creation from AI-generated videos, web images, and expert demonstrations; memory retrieval via joint DINO- and CLIP-based similarity; semantic 3D alignment to transfer full 6D grasps; and a refine-and-rank step over geometrically stable candidates to ensure executability. Key contributions include a memory construction paradigm independent of task-specific labels, a robust semantic 3D alignment strategy guided by dense features, and a grasp transfer mechanism that preserves task intent while honoring geometry, yielding strong generalization on TaskGrasp with 0.67 mAP and notable real-world success. The results demonstrate that leveraging generative models and cross-domain exemplars can achieve state-of-the-art performance in TOG without extensive annotated datasets, offering a scalable, data-efficient path toward adaptable robotic manipulation.

Abstract

Task-Oriented Grasping (TOG) requires robots to select grasps that are functionally appropriate for a specified task - a challenge that demands an understanding of task semantics, object affordances, and functional constraints. We present GRIM (Grasp Re-alignment via Iterative Matching), a training-free framework that addresses these challenges by leveraging Video Generation Models (VGMs) together with a retrieve-align-transfer pipeline. Beyond leveraging VGMs, GRIM can construct a memory of object-task exemplars sourced from web images, human demonstrations, or generative models. The retrieved task-oriented grasp is then transferred and refined by evaluating it against a set of geometrically stable candidate grasps to ensure both functional suitability and physical feasibility. GRIM demonstrates strong generalization and achieves state-of-the-art performance on standard TOG benchmarks. Project website: https://grim-tog.github.io

GRIM: Task-Oriented Grasping with Conditioning on Generative Examples

TL;DR

GRIM tackles data scarcity in Task-Oriented Grasping by adopting a training-free, memory-driven approach that retrieves functional priors from heterogeneous sources and aligns them in 3D through semantic cues. The framework leverages a retrieve-align-transfer pipeline: memory creation from AI-generated videos, web images, and expert demonstrations; memory retrieval via joint DINO- and CLIP-based similarity; semantic 3D alignment to transfer full 6D grasps; and a refine-and-rank step over geometrically stable candidates to ensure executability. Key contributions include a memory construction paradigm independent of task-specific labels, a robust semantic 3D alignment strategy guided by dense features, and a grasp transfer mechanism that preserves task intent while honoring geometry, yielding strong generalization on TaskGrasp with 0.67 mAP and notable real-world success. The results demonstrate that leveraging generative models and cross-domain exemplars can achieve state-of-the-art performance in TOG without extensive annotated datasets, offering a scalable, data-efficient path toward adaptable robotic manipulation.

Abstract

Task-Oriented Grasping (TOG) requires robots to select grasps that are functionally appropriate for a specified task - a challenge that demands an understanding of task semantics, object affordances, and functional constraints. We present GRIM (Grasp Re-alignment via Iterative Matching), a training-free framework that addresses these challenges by leveraging Video Generation Models (VGMs) together with a retrieve-align-transfer pipeline. Beyond leveraging VGMs, GRIM can construct a memory of object-task exemplars sourced from web images, human demonstrations, or generative models. The retrieved task-oriented grasp is then transferred and refined by evaluating it against a set of geometrically stable candidate grasps to ensure both functional suitability and physical feasibility. GRIM demonstrates strong generalization and achieves state-of-the-art performance on standard TOG benchmarks. Project website: https://grim-tog.github.io

Paper Structure

This paper contains 18 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The GRIM framework for task-oriented grasp synthesis. From a single scene image, the VGM generates task-specific video examples, such as hammering (Task A) and handover (Task B). Grasps are extracted from these generated videos and then transferred to a robotic arm to execute the specified task in the real world as shown for hammering (Task A).
  • Figure 2: Our memory creation pipeline. A diverse set of inputs (AI-generated video frames, web images, human demonstrations) are processed by a hand-object reconstruction module wu2024reconstructing. This yields an object mesh and a corresponding task-oriented grasp pose. We enrich the object mesh with dense DINO features to create a feature mesh, which is stored in memory alongside the task label and grasp pose.
  • Figure 3: The GRIM pipeline for a given scene object and task. (1) Retrieval: The system queries its memory using joint visual and task similarity to find the best matching prior experience (a cup for the task 'drink'). (2) Alignment: The retrieved memory object (red point cloud) is aligned with the scene object (grey point cloud) using our feature-guided iterative alignment. The colors on the objects represent PCA-reduced DINO features, showing semantic correspondence. (3) Transfer & Refine: The grasp from the memory object is transferred to the scene object and used to select the best among a set of task-agnostic, stable grasp candidates (cluster of purple grasps), resulting in the final task-oriented grasp (single purple grasp).
  • Figure 4: Real-world deployment of GRIM with novel objects. The system correctly plans and executes task-oriented grasps. The Kinova Gen3 Lite robot successfully executing the planned grasp.