Table of Contents
Fetching ...

SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design

Tomohiro Motoda, Takahide Kitamura, Ryo Hanai, Yukiyasu Domae

TL;DR

A versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments, highlighting the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments.

Abstract

The development of large language models and vision-language models (VLMs) has resulted in the increasing use of robotic systems in various fields. However, the effective integration of these models into real-world robotic tasks is a key challenge. We developed a versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments. Our method highlights the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments. In the validation experiments, the system accurately selected suction points 75.4%, and achieved a 65.0% success rate in picking common items. This study highlights the effectiveness of VLMs in robotic manipulation tasks, even with simple 3D processing.

SuctionPrompt: Visual-assisted Robotic Picking with a Suction Cup Using Vision-Language Models and Facile Hardware Design

TL;DR

A versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments, highlighting the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments.

Abstract

The development of large language models and vision-language models (VLMs) has resulted in the increasing use of robotic systems in various fields. However, the effective integration of these models into real-world robotic tasks is a key challenge. We developed a versatile robotic system called SuctionPrompt that utilizes prompting techniques of VLMs combined with 3D detections to perform product-picking tasks in diverse and dynamic environments. Our method highlights the importance of integrating 3D spatial information with adaptive action planning to enable robots to approach and manipulate objects in novel environments. In the validation experiments, the system accurately selected suction points 75.4%, and achieved a 65.0% success rate in picking common items. This study highlights the effectiveness of VLMs in robotic manipulation tasks, even with simple 3D processing.

Paper Structure

This paper contains 14 sections, 6 equations, 8 figures, 4 tables, 1 algorithm.

Figures (8)

  • Figure 1: Overview of the proposed SuctionPrompt system for robot manipulation tasks. (a) RGB-depth (RGB-D) image and directive text are input. (b, c) Suction points are generated from estimated 3D surface normal vectors.(d) Robot is instructed to pick up a green-colored tea box by the vision-language model (VLM). The bottom panel shows various object-picking tasks (potato chip box, green tea box, cola bottle, etc.) with corresponding suction points for successful grasping.
  • Figure 2: Overview of the robotics system with SuctionPrompt. We propose a versatile robotic manipulation system using a suction-cup-based gripper combined with VLMs to achieve zero-shot object handling, specifically targeting a product-picking task in convenience stores. By integrating depth information from RGB-D cameras, we aim to provide a critical visual prompting for the robot, ensuring accurate interaction with various objects.
  • Figure 3: Pipeline for a visual prompting. The process begins by capturing depth images to create 3D point clouds of the scene. These point clouds are then divided into clusters using the K-means++ algorithm. Surface normals are calculated for each cluster, providing important 3D pose information. The 3D points and their corresponding normals are then projected onto the 2D RGB image to create visual cues for candidates on which suction action is to be performed, which are marked with numbered annotations.
  • Figure 4: Robot arm system for actual machine verification. (a) Outline of the robot arm system, (b) Suction-gripper-based end-effector.
  • Figure 5: Responses from the vision-language model (GPT-4o) regarding the numbering of suction points on the prompted image, along with the rationale for each selected point.
  • ...and 3 more figures