Lan-grasp: Using Large Language Models for Semantic Object Grasping and Placement
Reihaneh Mirjalili, Michael Krawez, Yannik Blei, Simone Silenzi, Florian Walter, Wolfram Burgard
TL;DR
Lan-grasp tackles semantic grasping and safe object placement in everyday environments by leveraging foundation models in a zero-shot setting. It combines a Large Language Model for identifying grasp-relevant object parts, a Vision-Language Model for grounding, and a conventional grasp planner to generate functionally meaningful grasps, with a Visual Chain-of-Thought feedback loop to ensure feasibility. The paper also proposes an upright object placement method using SAM 3D and VLM reasoning to determine correct orientation, including a two-cycle alignment to compensate pose errors. Real-world experiments on 22 objects and a human-subject survey demonstrate that Lan-grasp yields grasps more in line with human preferences than baselines and achieves high placement success, though challenges remain for translucent objects and precise pose estimation.
Abstract
In this paper, we propose Lan-grasp, a novel approach towards more appropriate semantic grasping and placing. We leverage foundation models to equip the robot with a semantic understanding of object geometry, enabling it to identify the right place to grasp, which parts to avoid, and the natural pose for placement. This is an important contribution to grasping and utilizing objects in a more meaningful and safe manner. We leverage a combination of a Large Language Model, a Vision-Language Model, and a traditional grasp planner to generate grasps that demonstrate a deeper semantic understanding of the objects. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without requiring further training or fine-tuning. We also propose a method for safely putting down a grasped object. The core idea is to rotate the object upright utilizing a pretrained generative model and the reasoning capabilities of a VLM. We evaluate our method in real-world experiments on a custom object dataset and present the results of a survey that asks participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. In addition, we propose a Visual Chain-of-Thought feedback loop to assess grasp feasibility in complex scenarios. This mechanism enables dynamic reasoning and generates alternative grasp strategies when needed, ensuring safer and more effective grasping outcomes.
