Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

Tianyu Wang; Haitao Lin; Junqiu Yu; Yanwei Fu

Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

Tianyu Wang, Haitao Lin, Junqiu Yu, Yanwei Fu

TL;DR

Polaris addresses open-ended tabletop manipulation by fusing GPT-4-based perception with grounded vision and a synthetic-to-real 6D pose estimation pipeline trained entirely on synthetic data. The Syn2Real category-level pose estimator extends SAR-Net to 24 categories, enabling real-world pose inference for a 6D pose-based manipulation planner within an interactive LLM-guided framework. Real-robot experiments validate high pose estimation accuracy and effective task execution, including complex compositional tasks, while ablations show the value of components such as Grounded-Light-HQSAM and GPT-4 prompts. The approach promises robust generalization to broader object categories and task domains, suggesting practical potential for flexible, open-ended human-robot collaboration on real-world tables.

Abstract

This paper investigates the task of the open-ended interactive robotic manipulation on table-top scenarios. While recent Large Language Models (LLMs) enhance robots' comprehension of user instructions, their lack of visual grounding constrains their ability to physically interact with the environment. This is because the robot needs to locate the target object for manipulation within the physical workspace. To this end, we introduce an interactive robotic manipulation framework called Polaris, which integrates perception and interaction by utilizing GPT-4 alongside grounded vision models. For precise manipulation, it is essential that such grounded vision models produce detailed object pose for the target object, rather than merely identifying pixels belonging to them in the image. Consequently, we propose a novel Synthetic-to-Real (Syn2Real) pose estimation pipeline. This pipeline utilizes rendered synthetic data for training and is then transferred to real-world manipulation tasks. The real-world performance demonstrates the efficacy of our proposed pipeline and underscores its potential for extension to more general categories. Moreover, real-robot experiments have showcased the impressive performance of our framework in grasping and executing multiple manipulation tasks. This indicates its potential to generalize to scenarios beyond the tabletop. More information and video results are available here: https://star-uu-wang.github.io/Polaris/

Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

TL;DR

Abstract

Paper Structure (11 sections, 5 figures, 2 tables, 1 algorithm)

This paper contains 11 sections, 5 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Problem Formulation
Method
3D Synthetic Data Rendering
Syn2Real Category-level Pose Estimation
Open-ended Interactive Robotic Manipulation
Experiment
Real-world Object Pose Estimation Evaluation
Open-ended Interactive Real-Robot Experiments
Conclusion

Figures (5)

Figure 1: Polaris: A tabletop-level object robotic manipulation framework centered on syn2real visual grounding driven by open-ended interaction with GPT-4. Users engage in continuous, open-ended interaction with LLM, which maintains an ongoing comprehension of the scenes. 3D synthetic data is integrated into the training of grounded vision modules to facilitate the execution of real-world tabletop-level robotic tasks.
Figure 2: Overview of our framework. (a) 3D synthetic data rendering. During rendering, we automatically generate various synthetic data by loading 3D model assets into a simulation engine and deploying dynamic virtual camera. We use the Fibonacci Sphere Sampling to select rendering viewpoints, to generate corresponding RGB, depth, pose, and observable point clouds. (b) The vision-centric robotic task pipeline. Given the image of the scene, which GPT-4, prompted as a scene perception and interaction LLM, interprets to understand instructions and describe objects and tasks. Our parser interprets these descriptions. We freeze the pre-trained detector and segmentation model within the grounded vision models and use a synthetic dataset to train the category-level pose estimation model. After retrieving object attributes, the model predicts poses based on the scene, allowing a 6D pose robot manipulation planner to execute real-world tasks.
Figure 3: Real-world experimental objects. We test our method using different instances from multiple tabletop-level objects, some of which are confusing in terms of color, shape, rigidity, deformability, and functionality.
Figure 4: Results of real-world object pose estimation. (a) Test results of single-object scene. We present a subset of the visualization results of the pose and size estimation using the trained MVPoseNet6D model. The outcomes are represented with a tightly oriented 3D bounding box and colored XYZ-axis. (b) The scene with same object under multiple views. We show the pose of a bottle under different views. (c) The scene with multiple objects under the same view. We show the pose estimation of different objects in several cluttered scenes.
Figure 5: Examples of open-ended interactive real-robot experiments. Manipulation tasks for three different base scenes are presented, including excerpts from the interaction process between the user and the LLM, the pose estimation results of the manipulated objects in different scenes, and the keyframes of the robot manipulation. Scene A: Stack bottles on the table. Scene B: Tidy the items of workbench. Scene C: A compositional task considering the affordance of objects after a sudden collision.

Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

TL;DR

Abstract

Polaris: Open-ended Interactive Robotic Manipulation via Syn2Real Visual Grounding and Large Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (5)