RoboBERT: An End-to-end Multimodal Robotic Manipulation Model
Sicheng Wang, Sheng Liu, Weiheng Wang, Jianhua Shan, Bin Fang
TL;DR
RoboBERT tackles data efficiency in multimodal robotic manipulation by decoupling policy learning from linguistic generalization via a two-stage training paradigm. It uses a diffusion-based policy head, a BERT-based language connector, and a CLIP-ViT visual backbone to fuse language and vision into end-to-end control, trained with Behavioral Cloning on language-labeled demonstrations. The approach achieves state-of-the-art mean episode lengths on CALVIN ABCD-D and ABC-D benchmarks (4.52 and 3.79) without relying on large robotics datasets, and shows superior real-robot success rates on a 6-DOF manipulator compared to comparable methods. These results suggest that data-augmentation-enhanced two-stage training is an efficient, scalable path for robust multimodal robotic manipulation.
Abstract
Embodied intelligence seamlessly integrates vision, language, and action.~However, most multimodal robotic models rely on massive fine-tuning, incurring high time and hardware costs.~To address this, we introduce RoboBERT, an end-to-end multimodal manipulation model built around a novel two-stage training paradigm.~In the first stage, we freeze most of the vision encoder and train with a single "standard" instruction phrasing, allowing the model to focus on stable policy learning via a CNN-based diffusion policy.~In the second stage, we unfreeze all modules and inject diverse natural language variants, rapidly aligning varied instructions to the already-learned policy without destabilizing performance.~We further employ systematic data augmentations to enhance robustness against visual perturbations.~Without relying on auxiliary datasets, RoboBERT achieves new state-of-the-art (SOTA) mean episode lengths of 4.52 on the CALVIN ABCD-D benchmark and 3.79 on the ABC-D benchmark using only language-labeled expert demonstrations and a comparatively lightweight architecture.Real-robot trials on a 6-DOF manipulator confirm higher success rates than comparable methods trained on identical data.These results demonstrate that our data-augmentation-enhanced two-stage training paradigm delivers efficient, scalable, and broadly applicable performance for multimodal robotic systems.
