Triple Regression for Camera Agnostic Sim2Real Robot Grasping and Manipulation Tasks

Yuanhong Zeng; Yizhou Zhao; Ying Nian Wu

Triple Regression for Camera Agnostic Sim2Real Robot Grasping and Manipulation Tasks

Yuanhong Zeng, Yizhou Zhao, Ying Nian Wu

TL;DR

The Triple Regression Sim2Real framework is introduced, which constructs a real-time digital twin that serves as a replica of reality to simulate and evaluate multiple plans before their execution in real-world scenarios.

Abstract

Sim2Real (Simulation to Reality) techniques have gained prominence in robotic manipulation and motion planning due to their ability to enhance success rates by enabling agents to test and evaluate various policies and trajectories. In this paper, we investigate the advantages of integrating Sim2Real into robotic frameworks. We introduce the Triple Regression Sim2Real framework, which constructs a real-time digital twin. This twin serves as a replica of reality to simulate and evaluate multiple plans before their execution in real-world scenarios. Our triple regression approach addresses the reality gap by: (1) mitigating projection errors between real and simulated camera perspectives through the first two regression models, and (2) detecting discrepancies in robot control using the third regression model. Experiments on 6-DoF grasp and manipulation tasks (where the gripper can approach from any direction) highlight the effectiveness of our framework. Remarkably, with only RGB input images, our method achieves state-of-the-art success rates. This research advances efficient robot training methods and sets the stage for rapid advancements in robotics and automation.

Triple Regression for Camera Agnostic Sim2Real Robot Grasping and Manipulation Tasks

TL;DR

Abstract

Paper Structure (10 sections, 8 equations, 5 figures, 3 tables)

This paper contains 10 sections, 8 equations, 5 figures, 3 tables.

Introduction
Method
Joint task planning module
Triple Regression for Matching Simulation and Reality
Validating Sim2Real matching
Experiment
Ablation study on the triple regression framework
Grasp and pour experiment
Comparison of VQA models
Conclusion

Figures (5)

Figure 1: Text-to-action framework: Tasks described language are converted into a joint representation which records spatial, and temporal information of the scene and task. A digital twin is created based on semantic grounding and camera observation. Motions are planned in the simulator. VQA is used to ascertain the success of the performed action
Figure 2: Representation of pouring water from a jar to a cup with STC-AOG. The data structure reduces a sentence into subgoals for task planning.
Figure 3: Workflow of triple regression framework. The framework creates digital twins by 1) camera shots, 2) Semantic-based segmentation and contour extraction, 3) key point identification, and 4) object placement in simulation. Plans are generated and simulated. Coordinates of successful plans are corrected and executed in reality.
Figure 4: Query fluent matching in robot execution. We select four crucial moments during the execution of the robot in both simulation and reality and report GPT-4V's VQA query answers for the questionnaire: (1) Is the robot ready to pick up the jar? (2) Has the robot already picked up the jar? (3) Is the jar above the cup? (4) Is the robot pouring water to the cup? Output from VQA models and ground truths are shown
Figure 5: Task execution speed comparison. We compare our method with 6D-CLIPort, RL-based robot motion planning, and human-guided heuristic control, and we measure the average execution time for the whole task.

Triple Regression for Camera Agnostic Sim2Real Robot Grasping and Manipulation Tasks

TL;DR

Abstract

Triple Regression for Camera Agnostic Sim2Real Robot Grasping and Manipulation Tasks

Authors

TL;DR

Abstract

Table of Contents

Figures (5)