Table of Contents
Fetching ...

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

Fanfan Liu, Feng Yan, Liming Zheng, Chengjian Feng, Yiyang Huang, Lin Ma

TL;DR

RoboUniView tackles poor cross-camera generalization in visual-language robotic manipulation by decoupling visual feature extraction from action learning. It introduces UVFormer to convert multi-view observations into a unified 3D space and uses a fusion-decoder architecture to produce actions from this representation, with pre-training on a 3D occupancy task. The approach achieves state-of-the-art results on the CALVIN benchmark, showing strong zero-shot and cross-dataset generalization to unseen camera parameters and tasks. These results suggest a robust, adaptable framework for embodied AI that can transfer across robots with different perceptual configurations.

Abstract

Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the $D \to D$ setting from 93.0% to 96.2%, and in the $ABC \to D$ setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview

RoboUniView: Visual-Language Model with Unified View Representation for Robotic Manipulation

TL;DR

RoboUniView tackles poor cross-camera generalization in visual-language robotic manipulation by decoupling visual feature extraction from action learning. It introduces UVFormer to convert multi-view observations into a unified 3D space and uses a fusion-decoder architecture to produce actions from this representation, with pre-training on a 3D occupancy task. The approach achieves state-of-the-art results on the CALVIN benchmark, showing strong zero-shot and cross-dataset generalization to unseen camera parameters and tasks. These results suggest a robust, adaptable framework for embodied AI that can transfer across robots with different perceptual configurations.

Abstract

Utilizing Vision-Language Models (VLMs) for robotic manipulation represents a novel paradigm, aiming to enhance the model's ability to generalize to new objects and instructions. However, due to variations in camera specifications and mounting positions, existing methods exhibit significant performance disparities across different robotic platforms. To address this challenge, we propose RoboUniView in this paper, an innovative approach that decouples visual feature extraction from action learning. We first learn a unified view representation from multi-perspective views by pre-training on readily accessible data, and then derive actions from this unified view representation to control robotic manipulation. This unified view representation more accurately mirrors the physical world and is not constrained by the robotic platform's camera parameters. Thanks to this methodology, we achieve state-of-the-art performance on the demanding CALVIN benchmark, enhancing the success rate in the setting from 93.0% to 96.2%, and in the setting from 92.2% to 94.2%. Moreover, our model exhibits outstanding adaptability and flexibility: it maintains high performance under unseen camera parameters, can utilize multiple datasets with varying camera parameters, and is capable of joint cross-task learning across datasets. Code is provided for re-implementation. https://github.com/liufanfanlff/RoboUniview

Paper Structure

This paper contains 17 sections, 6 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: The visualization of RoboUniView on $D \to D$ split. The first row shows the predicted occupancy, and the second row shows the predicted rollouts.
  • Figure 2: Overview of the RoboUniView. RoboUniView is first pre-trained on the 3D occupancy task, and then fine-tuned on robot data to learn multi-task visual robot manipulation.
  • Figure 3: UVFormer which contain grid-shaped UniView queries, Spatial Cross-Attention, and Self-Attention. Within the Spatial Cross-Attention, each UniView query interacts only with the image features of the pixel coordinates projected by its corresponding $P$ 3D points.
  • Figure 4: Visualization of environmental configurations in Advanced Experiments.