PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Dingkun Guo; Yuqi Xiang; Shuqi Zhao; Xinghao Zhu; Masayoshi Tomizuka; Mingyu Ding; Wei Zhan

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Dingkun Guo, Yuqi Xiang, Shuqi Zhao, Xinghao Zhu, Masayoshi Tomizuka, Mingyu Ding, Wei Zhan

TL;DR

PhyGrasp addresses the challenge of generalizing robotic grasping to counter-intuitive and long-tailed objects by grounding physical commonsense in a physics-informed, multimodal framework. It combines a frozen 3D vision encoder (PointNext) and a language model (Llama 2) through a bridge network to produce per-point grasp affordances and embedding-based pairings, trained with three losses balanced by AWL. The authors introduce PhyPartNet, a dataset of approximately 195K object instances with part-level physical properties and language annotations, plus analytical grasping solutions to generate ground-truth supervision and language summaries via GPT-3.5. In both simulation and real-world experiments, PhyGrasp achieves state-of-the-art performance, particularly on long-tailed and hard instances, with about a 10% improvement over GraspNet, demonstrating the practical impact of physics-informed, language-grounded grasping for safer and more adaptable robotic manipulation.

Abstract

Robotic grasping is a fundamental aspect of robot functionality, defining how robots interact with objects. Despite substantial progress, its generalizability to counter-intuitive or long-tailed scenarios, such as objects with uncommon materials or shapes, remains a challenge. In contrast, humans can easily apply their intuitive physics to grasp skillfully and change grasps efficiently, even for objects they have never seen before. This work delves into infusing such physical commonsense reasoning into robotic manipulation. We introduce PhyGrasp, a multimodal large model that leverages inputs from two modalities: natural language and 3D point clouds, seamlessly integrated through a bridge module. The language modality exhibits robust reasoning capabilities concerning the impacts of diverse physical properties on grasping, while the 3D modality comprehends object shapes and parts. With these two capabilities, PhyGrasp is able to accurately assess the physical properties of object parts and determine optimal grasping poses. Additionally, the model's language comprehension enables human instruction interpretation, generating grasping poses that align with human preferences. To train PhyGrasp, we construct a dataset PhyPartNet with 195K object instances with varying physical properties and human preferences, alongside their corresponding language descriptions. Extensive experiments conducted in the simulation and on the real robots demonstrate that PhyGrasp achieves state-of-the-art performance, particularly in long-tailed cases, e.g., about 10% improvement in success rate over GraspNet. Project page: https://sites.google.com/view/phygrasp

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

TL;DR

Abstract

Paper Structure (27 sections, 4 equations, 7 figures, 4 tables)

This paper contains 27 sections, 4 equations, 7 figures, 4 tables.

Introduction
Related Work
Physical Reasoning
Large Multimodal Models
Large Models for Robot Learning
Grasp Pose Detection
Dataset Generation
Dataset Statistics
Analytical Grasping Solutions
Language Summary Generation
Learning Methods
Feature Extraction
Vision Encoder
Language Encoder
Bridge Network
...and 12 more sections

Figures (7)

Figure 1: Motivation of PhyGrasp. Current robot grasping policies (left) typically predict grasping poses based solely on the object's 3D shape, neglecting its physical properties. This oversight can lead to potential damage to the display. In contrast, integrating physical common sense into robotic systems (right) can address this issue effectively.
Figure 2: Dataset Statistics. The left and right figures denote instance distributions among objects and materials, respectively.
Figure 3: An overview of our PhyPartNet generation pipeline and our PhyGrasp framework. Given object meshes sampled from PartNet, we leverage GPT-3.5 and an analytical method to automatically generate the grasping affordance map and language descriptions for the object instance. The generated data is then human-verified, forming our PhyPartNet. We freeze PointNext qian2022pointnext and Llama 2 touvron2023llama2 and tune the bridge network during training on PhyPartNet. After training, PhyGrasp is able to generalize to novel 3D point clouds and new natural language instructions.
Figure 4: The architecture for the bridge module of PhyGrasp. It outputs the grasping probability (affordance map) and the pair embedding for each point.
Figure 5: Visualizations of the affordance map and grasping pair match map for our method. The left column is the affordance map of the analytical method (ground truth), the middle is our affordance map, and the right is the grasping pair match map. We observe that our affordance map prediction exhibits high quality and closely resembles the ground truth. In the match map, yellow intensity indicates the matching confidence, with red and yellow points representing an anchor and its top-1 matching pair.
...and 2 more figures

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

TL;DR

Abstract

PhyGrasp: Generalizing Robotic Grasping with Physics-informed Large Multimodal Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)