OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Soofiyan Atar; Yi Li; Markus Grotz; Michael Wolf; Dieter Fox; Joshua Smith

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Soofiyan Atar, Yi Li, Markus Grotz, Michael Wolf, Dieter Fox, Joshua Smith

TL;DR

This work proposes an innovative approach that leverages foundation models to enhance suction grasping using only RGB images and achieves an 81.9% success rate in real-world applications.

Abstract

In warehouse environments, robots require robust picking capabilities to manage a wide variety of objects. Effective deployment demands minimal hardware, strong generalization to new products, and resilience in diverse settings. Current methods often rely on depth sensors for structural information, which suffer from high costs, complex setups, and technical limitations. Inspired by recent advancements in computer vision, we propose an innovative approach that leverages foundation models to enhance suction grasping using only RGB images. Trained solely on a synthetic dataset, our method generalizes its grasp prediction capabilities to real-world robots and a diverse range of novel objects not included in the training set. Our network achieves an 82.3\% success rate in real-world applications. The project website with code and data will be available at http://optigrasp.github.io.

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

TL;DR

This work proposes an innovative approach that leverages foundation models to enhance suction grasping using only RGB images and achieves an 81.9% success rate in real-world applications.

Abstract

Paper Structure (16 sections, 2 equations, 6 figures, 5 tables)

This paper contains 16 sections, 2 equations, 6 figures, 5 tables.

INTRODUCTION
RELATED WORK
Analytic Models
Learning Suction Grasps
Problem definition
METHOD
Network Structure
Simulation Environment and Data Generation
Data Labelling
Affordance Grasp Score
Training
EXPERIMENTS
Real Robot Setup
Results
Failure Cases and Future Work
...and 1 more sections

Figures (6)

Figure 1: Our robot is picking from a cluttered industrial shelving unit.
Figure 2: The system architecture. The network takes an RGB image and the mask of the target object as inputs and predicts three dense prediction maps, each of the same size as the input image. These maps predict the affordance grasp score, pitch angle, and yaw angle at each pixel, as described in \ref{['sec:grasp_score']}. The higher the value, the redder it is visualized. This prediction is further processed to determine the optimal grasp pose for the suction gripper to pick the object. For the best grasp point, the highest value from the grasp score affordance map is selected, and the corresponding pixel from the pitch affordance map and yaw affordance map is used to compute the final grasp pose. The DINOv2 oquab2023dinov2 backbone from Depth Anything yang2024depth retains its frozen weights, while the Dense Prediction Transformer (DPT) ranftl2021vision is refined during training.
Figure 3: Illustration of the synthetic data we generated. The first row shows RGB images, while the second-row lists per-pixel affordance scores computed with the affordance grasp score in \ref{['eq:deformation_cost']}
Figure 4: Robotic work cell.
Figure 5: Our three object sets range from easy, medium to Hard (left to right)
...and 1 more figures

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

TL;DR

Abstract

OptiGrasp: Optimized Grasp Pose Detection Using RGB Images for Warehouse Picking Robots

Authors

TL;DR

Abstract

Table of Contents

Figures (6)