Table of Contents
Fetching ...

Task-Aware Robotic Grasping by evaluating Quality Diversity Solutions through Foundation Models

Aurel X. Appius, Emiland Garrabe, Francois Helenon, Mahdi Khoramshahi, Mohamed Chetouani, Stephane Doncieux

TL;DR

The paper tackles task-aware robotic grasping by fusing semantic segmentation and geometric reasoning through LLMs and Quality Diversity to produce zero-shot task-conditioned grasps. It introduces an open-vocabulary 3D segmentation pipeline, uses an LLM to identify graspable and task-relevant subparts, and leverages a QD grasp archive to score and select grasps via a task-compatibility function $C(g,\mathcal{T})$. On a subset of the YCB dataset with a Franka Panda robot, it reports a weighted IoU of $73.6\%$ for task-conditioned grasp regions and $88\%$ human preference for task-aware grasps in end-to-end validation, with strong statistical significance. The approach offers a training-free, scalable route to task-aligned grasping, with potential extensions to more complex geometries and richer LLM-grounding for improved robustness.

Abstract

Task-aware robotic grasping is a challenging problem that requires the integration of semantic understanding and geometric reasoning. This paper proposes a novel framework that leverages Large Language Models (LLMs) and Quality Diversity (QD) algorithms to enable zero-shot task-conditioned grasp synthesis. The framework segments objects into meaningful subparts and labels each subpart semantically, creating structured representations that can be used to prompt an LLM. By coupling semantic and geometric representations of an object's structure, the LLM's knowledge about tasks and which parts to grasp can be applied in the physical world. The QD-generated grasp archive provides a diverse set of grasps, allowing us to select the most suitable grasp based on the task. We evaluated the proposed method on a subset of the YCB dataset with a Franka Emika robot. A consolidated ground truth for task-specific grasp regions is established through a survey. Our work achieves a weighted intersection over union (IoU) of 73.6% in predicting task-conditioned grasp regions in 65 task-object combinations. An end-to-end validation study on a smaller subset further confirms the effectiveness of our approach, with 88% of responses favoring the task-aware grasp over the control group. A binomial test shows that participants significantly prefer the task-aware grasp.

Task-Aware Robotic Grasping by evaluating Quality Diversity Solutions through Foundation Models

TL;DR

The paper tackles task-aware robotic grasping by fusing semantic segmentation and geometric reasoning through LLMs and Quality Diversity to produce zero-shot task-conditioned grasps. It introduces an open-vocabulary 3D segmentation pipeline, uses an LLM to identify graspable and task-relevant subparts, and leverages a QD grasp archive to score and select grasps via a task-compatibility function . On a subset of the YCB dataset with a Franka Panda robot, it reports a weighted IoU of for task-conditioned grasp regions and human preference for task-aware grasps in end-to-end validation, with strong statistical significance. The approach offers a training-free, scalable route to task-aligned grasping, with potential extensions to more complex geometries and richer LLM-grounding for improved robustness.

Abstract

Task-aware robotic grasping is a challenging problem that requires the integration of semantic understanding and geometric reasoning. This paper proposes a novel framework that leverages Large Language Models (LLMs) and Quality Diversity (QD) algorithms to enable zero-shot task-conditioned grasp synthesis. The framework segments objects into meaningful subparts and labels each subpart semantically, creating structured representations that can be used to prompt an LLM. By coupling semantic and geometric representations of an object's structure, the LLM's knowledge about tasks and which parts to grasp can be applied in the physical world. The QD-generated grasp archive provides a diverse set of grasps, allowing us to select the most suitable grasp based on the task. We evaluated the proposed method on a subset of the YCB dataset with a Franka Emika robot. A consolidated ground truth for task-specific grasp regions is established through a survey. Our work achieves a weighted intersection over union (IoU) of 73.6% in predicting task-conditioned grasp regions in 65 task-object combinations. An end-to-end validation study on a smaller subset further confirms the effectiveness of our approach, with 88% of responses favoring the task-aware grasp over the control group. A binomial test shows that participants significantly prefer the task-aware grasp.

Paper Structure

This paper contains 22 sections, 10 equations, 6 figures, 1 table.

Figures (6)

  • Figure 1: The proposed workflow for our task-aware grasping framework. An object and a task are given to the model, which are then segmented into a labeled dictionary of subparts. The relevant subparts for grasping and task execution are determined by prompting a Large Language Model (LLM). Finally, Quality Diversity algorithms generate grasp candidates for the object. The ideal grasp for a given task is found by maximizing a score function that rewards grasping on the suggested grasp subpart, maintaining a distance from the task subpart, while encouraging a high grasp force.
  • Figure 2: The zero-shot semantic fine-grained segmentation is achieved by performing a Principal Component Analysis (PCA) and rendering the 3D Object along its most variance-explaining axes. The 2D Image is then segmented by segment-anythingkirillov2023segment and labeled using GPT-4oopenai2024gpt4technicalreport as a Vision Language Model (VLM). A projection back to 3D is used to generate the labeled semantic 3D subparts.
  • Figure 3: Examples of labeled segmentation masks generated by the grasp segmentation pipeline using GPT-4oopenai2024gpt4technicalreport as the VLM.
  • Figure 4: The Franka Emika robotic arm, equipped with a Panda 2-DoF gripper, used for conducting experimental evaluations of task-specific grasping. The figure illustrates the progression of two distinct grasps, each conditioned on a separate task. The left grasp is for the task of cutting something with the knife, and the right one is for handing the knife over to someone.
  • Figure 5: Consolidated ground truth of grasping regions that were determined in the survey. For each task, the participants had to highlight a region were they would grasp the object in order to do a subsequent task. The figure shows an aggregation of the data that was collected. The intensity of green indicates the regions popularity for the specific task.
  • ...and 1 more figures