Click to Grasp: Zero-Shot Precise Manipulation via Visual Diffusion Descriptors
Nikolaos Tsagkas, Jack Rome, Subramanian Ramamoorthy, Oisin Mac Aodha, Chris Xiaoxuan Lu
TL;DR
Click to Grasp (C2G) addresses zero-shot precise manipulation across object instances with visually ambiguous parts by grounding diffusion-derived descriptors into a dense semantic correspondence framework. The method builds a multi-view implicit descriptor field $\mathcal{F}$ augmented with diffusion and DINO features to ground a single source-image click into a 3D area of interaction $\mathcal{A}$ and then optimizes a gripper pose $\mathbf{T}\in SE(3)$ via a differentiable, geometry-based loss. Key contributions include the modular perception-to-manipulation pipeline, 3D area grounding from cross-instance descriptors, and a collision-aware pose optimization that requires no manipulation demonstrations; real-world experiments on stuffed toys and shoes report about 92% grasping success. This work demonstrates the viability of combining diffusion-derived descriptors with classical robotics pipelines for semantic-aware manipulation in practical tabletop settings, enabling precise part-level interactions with minimal user input.
Abstract
Precise manipulation that is generalizable across scenes and objects remains a persistent challenge in robotics. Current approaches for this task heavily depend on having a significant number of training instances to handle objects with pronounced visual and/or geometric part ambiguities. Our work explores the grounding of fine-grained part descriptors for precise manipulation in a zero-shot setting by utilizing web-trained text-to-image diffusion-based generative models. We tackle the problem by framing it as a dense semantic part correspondence task. Our model returns a gripper pose for manipulating a specific part, using as reference a user-defined click from a source image of a visually different instance of the same object. We require no manual grasping demonstrations as we leverage the intrinsic object geometry and features. Practical experiments in a real-world tabletop scenario validate the efficacy of our approach, demonstrating its potential for advancing semantic-aware robotics manipulation. Web page: https://tsagkas.github.io/click2grasp
