Table of Contents
Fetching ...

Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors

Nikolaos Tsagkas, Jack Rome, Subramanian Ramamoorthy, Oisin Mac Aodha, Chris Xiaoxuan Lu

TL;DR

Click to Grasp (C2G) addresses zero-shot precise manipulation across object instances with visually ambiguous parts by grounding diffusion-derived descriptors into a dense semantic correspondence framework. The method builds a multi-view implicit descriptor field $\mathcal{F}$ augmented with diffusion and DINO features to ground a single source-image click into a 3D area of interaction $\mathcal{A}$ and then optimizes a gripper pose $\mathbf{T}\in SE(3)$ via a differentiable, geometry-based loss. Key contributions include the modular perception-to-manipulation pipeline, 3D area grounding from cross-instance descriptors, and a collision-aware pose optimization that requires no manipulation demonstrations; real-world experiments on stuffed toys and shoes report about 92% grasping success. This work demonstrates the viability of combining diffusion-derived descriptors with classical robotics pipelines for semantic-aware manipulation in practical tabletop settings, enabling precise part-level interactions with minimal user input.

Abstract

Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp

Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors

TL;DR

Click to Grasp (C2G) addresses zero-shot precise manipulation across object instances with visually ambiguous parts by grounding diffusion-derived descriptors into a dense semantic correspondence framework. The method builds a multi-view implicit descriptor field augmented with diffusion and DINO features to ground a single source-image click into a 3D area of interaction and then optimizes a gripper pose via a differentiable, geometry-based loss. Key contributions include the modular perception-to-manipulation pipeline, 3D area grounding from cross-instance descriptors, and a collision-aware pose optimization that requires no manipulation demonstrations; real-world experiments on stuffed toys and shoes report about 92% grasping success. This work demonstrates the viability of combining diffusion-derived descriptors with classical robotics pipelines for semantic-aware manipulation in practical tabletop settings, enabling precise part-level interactions with minimal user input.

Abstract

Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp
Paper Structure (16 sections, 7 equations, 4 figures, 2 tables)

This paper contains 16 sections, 7 equations, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Zero-shot localization heatmaps for identifying the left arm of the target stuffed toy from a single click demonstration from the top left source image. CLIP leads to higher activations in irrelevant areas when querying with a natural language prompt, i.e., "left arm". DINO achieves high similarity scores for both arm instances, without a clear preference for the left arm. SD features lead to high activations on the left side, but are not localized on the arm. Our C2G approach correctly identifies the left arm.
  • Figure 2: Perception Modules$\mathcal{O}_{\mathcal{D}}$, $\mathcal{O}_{\mathcal{G}}$: (a) RGB-D images of the tabletop scene and a random source image of the object class are used as input, along with a single user-defined click, indicating the interaction area (Section \ref{['ssec:problem_formulation']}). (b) Images are lifted in 3D space by back-projection and interpolation of RGB values, densities, and features from DINO and SD (Section \ref{['ssec:scene_representation']}). (c) DINO and SD features from the source image are extracted and used to localize instances of the user-defined part, automatically identifying them as positive (same instance type) or negative (different instance type), and extracting visual descriptors (Section \ref{['ssec:descriptors']}). (d) DINO descriptors localize corresponding parts in the 3D scene, while SD descriptors disambiguate between instances, resulting in a 3D mask identifying the proposed interaction area (Section \ref{['ssec:area_of_interaction']}).
  • Figure 3: Manipulation module$\mathcal{O}_\mathcal{T}$: Given the proposed area of interaction $\mathcal{A}$, random gripper poses are initialised at each coordinate $\textbf{x}\in\mathcal{A}$. Then, collision-free poses are retained and optimized, using Eq. \ref{['eq:optimizer']}. The gripper pose with the lowest final loss score is then sent to the motion planner.
  • Figure 4: Visualization of results for three stuffed toys and three shoes manipulation experiments: (1) Source images along with the detected $(u,v)$ coordinates of the positive and negative part instances. (2) Reconstructed 3D scene. (3) Part similarity heatmap. (4) Part instance disambiguation. (5) Identifying the area of interaction versus the most similar part instance. (6) Optimized gripper pose. (7) Real-world object manipulation.