VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Weiyao Wang; Yutian Lei; Shiyu Jin; Gregory D. Hager; Liangjun Zhang

VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Weiyao Wang, Yutian Lei, Shiyu Jin, Gregory D. Hager, Liangjun Zhang

Abstract

In this work, we introduce the Virtual In-Hand Eye Transformer (VIHE), a novel method designed to enhance 3D manipulation capabilities through action-aware view rendering. VIHE autoregressively refines actions in multiple stages by conditioning on rendered views posed from action predictions in the earlier stages. These virtual in-hand views provide a strong inductive bias for effectively recognizing the correct pose for the hand, especially for challenging high-precision tasks such as peg insertion. On 18 manipulation tasks in RLBench simulated environments, VIHE achieves a new state-of-the-art, with a 12% absolute improvement, increasing from 65% to 77% over the existing state-of-the-art model using 100 demonstrations per task. In real-world scenarios, VIHE can learn manipulation tasks with just a handful of demonstrations, highlighting its practical utility. Videos and code implementation can be found at our project site: https://vihe-3d.github.io.

VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Abstract

Paper Structure (17 sections, 5 equations, 5 figures, 3 tables)

This paper contains 17 sections, 5 equations, 5 figures, 3 tables.

Introduction
Related Work
Vision-Based Imitation Learning for Robotic Manipulation
In-Hand View for Robotic Manipulation
Methods
Overview
Iterative View Rendering and Action Refinement
Initial Global Stage
Iterative Refinement
VIHE Architecture
Network Architecture
Action Prediction
Training
Experiments
Simulation Experiments
...and 2 more sections

Figures (5)

Figure 1: Visual example of VIHE in real-world. Our method iteratively refines its 3D action prediction (right) by rendering 2D in-hand views based on the previous stage predictions (left). Color coding of gray, green, and blue represent three action prediction stages respectively.
Figure 2: VIHE scales and performs better than RVT, PerAct, and Act3D among other baselines. Attribute to the inductive bias from in-hand views, VIHE also require 5X less time to achieve on-par performance to the previous SOTA method.
Figure 3: VIHE Overview. Starting with RGB-D images from multi-view cameras, we first construct a point cloud of the scene. Global views are first rendered using fixed cameras positioned around the workspace. From these global views, the network outputs initial action predictions $a_{pose}^0, a_{open}^0, a_{col}^0$. Then at each refinement stage $i$, we autoregressively generate virtual in-hand views from cameras attached to the previously predicted gripper pose $a_{pose}^{i-1}$. Based on the rendered views, we then refine the action predictions. The network architecture employs masked self-attention to have tokens from later stages attend to tokens from previous stages. Language instruction tokens are merged into stage 0 image tokens when input into transfer, which is omitted in the figure for conciseness. More information can be found in Sec. \ref{['sec:method']}.
Figure 4: Visualization of sample images from front camera in RGB view, global view and virtual in-hand view (both rendered as orthographic images). For global orthographic view and virtual in-hand view, two (top and front) out of five views (top, front, back, left and right) are visualized. Virtual in-hand view reveals information that is occluded from global view and in greater details, leading to better manipulation performance.
Figure 5: Real-world object manipulation tasks. A single VIHE model can perform multiple tasks (6 tasks, 18 variations) in the real world with just 72 demonstrations in total.

VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Abstract

VIHE: Virtual In-Hand Eye Transformer for 3D Robotic Manipulation

Authors

Abstract

Table of Contents

Figures (5)