Table of Contents
Fetching ...

Lan-grasp: Using Large Language Models for Semantic Object Grasping and Placement

Reihaneh Mirjalili, Michael Krawez, Yannik Blei, Simone Silenzi, Florian Walter, Wolfram Burgard

TL;DR

Lan-grasp tackles semantic grasping and safe object placement in everyday environments by leveraging foundation models in a zero-shot setting. It combines a Large Language Model for identifying grasp-relevant object parts, a Vision-Language Model for grounding, and a conventional grasp planner to generate functionally meaningful grasps, with a Visual Chain-of-Thought feedback loop to ensure feasibility. The paper also proposes an upright object placement method using SAM 3D and VLM reasoning to determine correct orientation, including a two-cycle alignment to compensate pose errors. Real-world experiments on 22 objects and a human-subject survey demonstrate that Lan-grasp yields grasps more in line with human preferences than baselines and achieves high placement success, though challenges remain for translucent objects and precise pose estimation.

Abstract

In this paper, we propose Lan-grasp, a novel approach towards more appropriate semantic grasping and placing. We leverage foundation models to equip the robot with a semantic understanding of object geometry, enabling it to identify the right place to grasp, which parts to avoid, and the natural pose for placement. This is an important contribution to grasping and utilizing objects in a more meaningful and safe manner. We leverage a combination of a Large Language Model, a Vision-Language Model, and a traditional grasp planner to generate grasps that demonstrate a deeper semantic understanding of the objects. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without requiring further training or fine-tuning. We also propose a method for safely putting down a grasped object. The core idea is to rotate the object upright utilizing a pretrained generative model and the reasoning capabilities of a VLM. We evaluate our method in real-world experiments on a custom object dataset and present the results of a survey that asks participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. In addition, we propose a Visual Chain-of-Thought feedback loop to assess grasp feasibility in complex scenarios. This mechanism enables dynamic reasoning and generates alternative grasp strategies when needed, ensuring safer and more effective grasping outcomes.

Lan-grasp: Using Large Language Models for Semantic Object Grasping and Placement

TL;DR

Lan-grasp tackles semantic grasping and safe object placement in everyday environments by leveraging foundation models in a zero-shot setting. It combines a Large Language Model for identifying grasp-relevant object parts, a Vision-Language Model for grounding, and a conventional grasp planner to generate functionally meaningful grasps, with a Visual Chain-of-Thought feedback loop to ensure feasibility. The paper also proposes an upright object placement method using SAM 3D and VLM reasoning to determine correct orientation, including a two-cycle alignment to compensate pose errors. Real-world experiments on 22 objects and a human-subject survey demonstrate that Lan-grasp yields grasps more in line with human preferences than baselines and achieves high placement success, though challenges remain for translucent objects and precise pose estimation.

Abstract

In this paper, we propose Lan-grasp, a novel approach towards more appropriate semantic grasping and placing. We leverage foundation models to equip the robot with a semantic understanding of object geometry, enabling it to identify the right place to grasp, which parts to avoid, and the natural pose for placement. This is an important contribution to grasping and utilizing objects in a more meaningful and safe manner. We leverage a combination of a Large Language Model, a Vision-Language Model, and a traditional grasp planner to generate grasps that demonstrate a deeper semantic understanding of the objects. Building on foundation models provides us with a zero-shot grasp method that can handle a wide range of objects without requiring further training or fine-tuning. We also propose a method for safely putting down a grasped object. The core idea is to rotate the object upright utilizing a pretrained generative model and the reasoning capabilities of a VLM. We evaluate our method in real-world experiments on a custom object dataset and present the results of a survey that asks participants to choose an object part appropriate for grasping. The results show that the grasps generated by our method are consistently ranked higher by the participants than those generated by a conventional grasping planner and a recent semantic grasping approach. In addition, we propose a Visual Chain-of-Thought feedback loop to assess grasp feasibility in complex scenarios. This mechanism enables dynamic reasoning and generates alternative grasp strategies when needed, ensuring safer and more effective grasping outcomes.
Paper Structure (18 sections, 1 equation, 12 figures, 4 tables)

This paper contains 18 sections, 1 equation, 12 figures, 4 tables.

Figures (12)

  • Figure 1: Robot performing the command of "Pick up the ice cream please". The grasp on the left is generated without including semantic information, while the grasp on the right is performed using our method, which leverages a deeper understanding of the task and the object provided by Large Language Models.
  • Figure 2: Our grasping approach in a nutshell: The command from the user is turned into a prompt suitable for the Large Language Model (LLM). With this prompt as an input, the LLM outputs the proper part for grasping the object, which in this example is the cone. This word is then grounded to the object image using a Vision-Language Model (VLM). The grounded grasp part is integrated into the 3D reconstruction model of the object to generate the proper grasp.
  • Figure 3: Summary of our placing approach showing one object alignment cycle. Given an RGB image of the grasped object, we first segment the object and then use SAM 3D to obtain the object 3D reconstruction and pose estimation. We then render the reconstructed mesh from six views, each showing one axis in the object frame up or down . A vision–language model is prompted to select the orientation that best matches placing the object down correctly. Finally, the robot executes the corresponding roll, pitch, and yaw rotation to align the grasp to the selected object pose.
  • Figure 4: The grasps performed by the HSR robot: Each column presents the grasps for one object. The first two rows for each object show the grasps generated without semantic knowledge about the objects, while the third and fourth rows show the grasps generated by Lan-grasp.
  • Figure 5: The grasps performed by the HSR robot: Each column presents the grasps for one object. The first two rows for each object, show the grasps generated without semantic knowledge about the objects, while the third and fourth rows show the grasps generated by Lan-grasp.
  • ...and 7 more figures