Table of Contents
Fetching ...

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

Jianxiong Li, Zhihao Wang, Jinliang Zheng, Xiaoai Zhou, Guanming Wang, Guanglu Song, Yu Liu, Jingjing Liu, Ya-Qin Zhang, Junzhi Yu, Xianyuan Zhan

TL;DR

Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of the proposed framework, demonstrating significant potential in overcoming data constraints in robotic learning.

Abstract

Multimodal task specification is essential for enhanced robotic performance, where \textit{Cross-modality Alignment} enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textit{Cross-modality Alignment} capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo_MUTUAL

Robo-MUTUAL: Robotic Multimodal Task Specification via Unimodal Learning

TL;DR

Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of the proposed framework, demonstrating significant potential in overcoming data constraints in robotic learning.

Abstract

Multimodal task specification is essential for enhanced robotic performance, where \textit{Cross-modality Alignment} enables the robot to holistically understand complex task instructions. Directly annotating multimodal instructions for model training proves impractical, due to the sparsity of paired multimodal data. In this study, we demonstrate that by leveraging unimodal instructions abundant in real data, we can effectively teach robots to learn multimodal task specifications. First, we endow the robot with strong \textit{Cross-modality Alignment} capabilities, by pretraining a robotic multimodal encoder using extensive out-of-domain data. Then, we employ two Collapse and Corrupt operations to further bridge the remaining modality gap in the learned multimodal representation. This approach projects different modalities of identical task goal as interchangeable representations, thus enabling accurate robotic operations within a well-aligned multimodal latent space. Evaluation across more than 130 tasks and 4000 evaluations on both simulated LIBERO benchmark and real robot platforms showcases the superior capabilities of our proposed framework, demonstrating significant advantage in overcoming data constraints in robotic learning. Website: zh1hao.wang/Robo_MUTUAL
Paper Structure (16 sections, 5 equations, 7 figures)

This paper contains 16 sections, 5 equations, 7 figures.

Figures (7)

  • Figure 1: Training robot policies on unimodal task prompts but evaluate using prompts across multi-modalities.
  • Figure 2: Robo-MUTUAL training pipeline. I. Pretrain robotic multimodal encoder consuming broader out-of-domain human and robotics data. II. Utilize the pretrained powerful Cross-modality Alignment capability and further bridge the modality gap in an efficient and training-free manner. III. Achieve multimodal task specifications via unimodal task learning leveraging the well-aligned multimodal representations.
  • Figure 3: Heatmaps of cosine similarity between representations of language and visual goals. The diagonals are matched pairs. DecisionNCE (Robo-MUTUAL) enjoys strong Cross-modality Alignment capability after absorbing broader out-of-domain data.
  • Figure 4: Abs. difference in the means of each embedding dimension cross different modalities. The modality gap manifests in a few dimensions with large discrepancies across modalities, while others remain consistent.
  • Figure 6: Simulation and real world robotics evaluation setups.
  • ...and 2 more figures