Table of Contents
Fetching ...

VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models

Manav Kulshrestha, S. Talha Bukhari, Damon Conover, Aniket Bera

TL;DR

VLAD-Grasp introduces a zero-shot grasping framework that leverages vision-language models (VLMs) to reason about object geometry and generate antipodal grasp candidates without task-specific training. From a single RGB-D image, it prompts a VLM to produce a goal image where a rod encodes the grasp axis, lifts this representation into 3D via monocular depth and segmentation, and aligns generated and observed object point clouds with PCA-based registration and a correspondence-free optimization to recover a 6-DoF grasp pose. The approach achieves competitive or superior results on Cornell and Jacquard in zero-shot settings and demonstrates real-world deployment on a Franka Research 3 robot, illustrating that foundation priors can support robust manipulation of novel objects without labeled grasp data. While promising, the method faces limitations from generation reliability, depth/segmentation accuracy, and computational cost, which motivate further work on efficiency and reliability of VLM-conditioned robotics pipelines.

Abstract

Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod "impales" the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.

VLAD-Grasp: Zero-shot Grasp Detection via Vision-Language Models

TL;DR

VLAD-Grasp introduces a zero-shot grasping framework that leverages vision-language models (VLMs) to reason about object geometry and generate antipodal grasp candidates without task-specific training. From a single RGB-D image, it prompts a VLM to produce a goal image where a rod encodes the grasp axis, lifts this representation into 3D via monocular depth and segmentation, and aligns generated and observed object point clouds with PCA-based registration and a correspondence-free optimization to recover a 6-DoF grasp pose. The approach achieves competitive or superior results on Cornell and Jacquard in zero-shot settings and demonstrates real-world deployment on a Franka Research 3 robot, illustrating that foundation priors can support robust manipulation of novel objects without labeled grasp data. While promising, the method faces limitations from generation reliability, depth/segmentation accuracy, and computational cost, which motivate further work on efficiency and reliability of VLM-conditioned robotics pipelines.

Abstract

Robotic grasping is a fundamental capability for autonomous manipulation; however, most existing methods rely on large-scale expert annotations and necessitate retraining to handle new objects. We present VLAD-Grasp, a Vision-Language model Assisted zero-shot approach for Detecting grasps. From a single RGB-D image, our method (1) prompts a large vision-language model to generate a goal image where a straight rod "impales" the object, representing an antipodal grasp, (2) predicts depth and segmentation to lift this generated image into 3D, and (3) aligns generated and observed object point clouds via principal component analysis and correspondence-free optimization to recover an executable grasp pose. Unlike prior work, our approach is training-free and does not rely on curated grasp datasets. Despite this, VLAD-Grasp achieves performance that is competitive with or superior to that of state-of-the-art supervised models on the Cornell and Jacquard datasets. We further demonstrate zero-shot generalization to novel real-world objects on a Franka Research 3 robot, highlighting vision-language foundation models as powerful priors for robotic manipulation.

Paper Structure

This paper contains 16 sections, 3 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: VLAD captures an image of the target object and queries a VLM using the cropped view alongside sequential guiding prompt to reason about the object’s geometry and feasible grasps. The VLM then generates a goal image depicting a virtual rod “impaling” the object, which encodes the antipodal grasp axis. This axis is reconstructed in 3D and aligned with the observed scene to yield an executable grasp pose.
  • Figure 2: Overview of our approach. We capture an RGB-D image $(I_S, D_S)$ of the object and mask out background distractors. The RGB image $I_S$ is provided to the VLM, following structured guiding prompts $T^g_i$, to help it reason $R^g$ about object geometry and eventually produce a generated image $I_G$, where the goal grasp is indicated by a rod passing through the antipodal grasp points on the object's surface. Following this, predict a point cloud $P^o_G$ for the object in the generated image $I_G$ and match it with the point cloud $P^o_S$ for the object in the original image $I_S$
  • Figure 3: Qualitative comparison of grasp detection on different objects. For each method, we show five grasps per object, although for some methods the grasp modalities may sometimes converge. Compared to baselines, our method detects more successful grasps across diverse object types. Prior methods often produce coarse or misaligned grasps, while our approach generates accurate and well-localized grasps that align with object geometry.
  • Figure 4: Examples of viable grasps generated by our approach (yellow), improperly marked as failures due to high angular mismatch (left) and missing ground truth annotations near the predicted grasp (right).