Table of Contents
Fetching ...

RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

Sicheng Wang, Sheng Liu, Weiheng Wang, Jianhua Shan, Bin Fang

TL;DR

RoboBERT tackles data efficiency in multimodal robotic manipulation by decoupling policy learning from linguistic generalization via a two-stage training paradigm. It uses a diffusion-based policy head, a BERT-based language connector, and a CLIP-ViT visual backbone to fuse language and vision into end-to-end control, trained with Behavioral Cloning on language-labeled demonstrations. The approach achieves state-of-the-art mean episode lengths on CALVIN ABCD-D and ABC-D benchmarks (4.52 and 3.79) without relying on large robotics datasets, and shows superior real-robot success rates on a 6-DOF manipulator compared to comparable methods. These results suggest that data-augmentation-enhanced two-stage training is an efficient, scalable path for robust multimodal robotic manipulation.

Abstract

Embodied intelligence seamlessly integrates vision, language, and action.~However, most multimodal robotic models rely on massive fine-tuning, incurring high time and hardware costs.~To address this, we introduce RoboBERT, an end-to-end multimodal manipulation model built around a novel two-stage training paradigm.~In the first stage, we freeze most of the vision encoder and train with a single "standard" instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy.~In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance.~We further employ systematic data augmentations to enhance robustness against visual perturbations.~Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD-D benchmark and 3.79 on the ABC-D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture.Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data.These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for multimodal robotic systems.

RoboBERT: An End-to-end Multimodal Robotic Manipulation Model

TL;DR

RoboBERT tackles data efficiency in multimodal robotic manipulation by decoupling policy learning from linguistic generalization via a two-stage training paradigm. It uses a diffusion-based policy head, a BERT-based language connector, and a CLIP-ViT visual backbone to fuse language and vision into end-to-end control, trained with Behavioral Cloning on language-labeled demonstrations. The approach achieves state-of-the-art mean episode lengths on CALVIN ABCD-D and ABC-D benchmarks (4.52 and 3.79) without relying on large robotics datasets, and shows superior real-robot success rates on a 6-DOF manipulator compared to comparable methods. These results suggest that data-augmentation-enhanced two-stage training is an efficient, scalable path for robust multimodal robotic manipulation.

Abstract

Embodied intelligence seamlessly integrates vision, language, and action.~However, most multimodal robotic models rely on massive fine-tuning, incurring high time and hardware costs.~To address this, we introduce RoboBERT, an end-to-end multimodal manipulation model built around a novel two-stage training paradigm.~In the first stage, we freeze most of the vision encoder and train with a single "standard" instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy.~In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance.~We further employ systematic data augmentations to enhance robustness against visual perturbations.~Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD-D benchmark and 3.79 on the ABC-D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture.Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data.These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for multimodal robotic systems.

Paper Structure

This paper contains 11 sections, 9 equations, 4 figures, 4 tables.

Figures (4)

  • Figure 1: (a) The RoboBERT architecture consists of language connectors, a modality fusion layer, and a diffusion head, responsible for sentence understanding, modality integration, and action generation, respectively. The last layer of the ViT is unfrozen during training to adapt to the task. (b) The policy workflow begins by taking observations from the last 1-2 frames, predicting actions over multiple frames, and then outputting actions for the near future. Afterward, new observations are taken, and the cycle repeats.
  • Figure 2: The illustration of two-stage training. Red and blue blocks represent the activate and freezed modules (a) The first stage training, predicting the corresponding output according to the stable and simple linguistic labels. (b) The second stage training, unfreeze all the parts and train on the natural languages
  • Figure 3: The illustration of data augmentations. (a) From left to right, it shows the original, polluted by salt-and-pepper noise, translation, color jitter images. (b) It shows the new data generated by mixing two RGB observations and corresponding action vectors.
  • Figure 4: Some examples for real robot experiments. From left to right, sequential table tasks, moving pen to pen holder and stacking cubes.