Training-free Task-oriented Grasp Generation
Jiaming Wang, Diwen Liu, Jizhuo Chen, Harold Soh
TL;DR
This work introduces TOG, a training-free pipeline that fuses pre-trained grasp-generation models with vision-language models to produce task-oriented grasps without additional training. It formalizes task-conditioned grasp generation as $P(G^* \\mid X, T)$ and employs a two-stage strategy: sample diverse candidate grasps from a pre-trained model, filter for feasibility, then use a VLM to select the grasp that best satisfies the task. Across simulation and real-world experiments, TOG demonstrates robust gains in task compliance and overall success, with the best performance achieved when a powerful VLM is allowed to freely indicate grasp points (CPG) or when a constrained, diverse set of grasps is filtered for VLM evaluation depending on VLM strength. The study highlights the potential of integrating foundation-model reasoning with traditional grasp generation to flexibly adapt to varied tasks, while identifying challenges in grasp diversity, query ambiguity, and spatial reasoning that point to future improvements such as view-based ambiguity reduction. This approach offers a scalable, training-free pathway to enhance task-oriented manipulation in robotics, enabling faster adaptation to new objects and tasks through modal model collaboration.
Abstract
This paper presents a training-free pipeline for task-oriented grasp generation that combines pre-trained grasp generation models with vision-language models (VLMs). Unlike traditional approaches that focus solely on stable grasps, our method incorporates task-specific requirements by leveraging the semantic reasoning capabilities of VLMs. We evaluate five querying strategies, each utilizing different visual representations of candidate grasps, and demonstrate significant improvements over a baseline method in both grasp success and task compliance rates, with absolute gains of up to 36.9\% in overall success rate. Our results underline the potential of VLMs to enhance task-oriented manipulation, providing insights for future research in robotic grasping and human-robot interaction.
