RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation
Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, Lin Ma
TL;DR
RoboTron-Mani introduces an all-in-one multimodal large model for robotic manipulation, integrating 3D perception, multimodal outputs, and cross-dataset training. The RoboData dataset unifies diverse robotic datasets with 3D alignment and standardized action representations, enabling joint learning across multiple embodiments. Empirical results show RoboTron-Mani achieving competitive or superior performance across simulated and real-world tasks, with strong cross-embodiment generalization and clear gains from 3D perception and modality isolation. This work proposes a scalable framework and evaluation standard that facilitates cross-domain embodied AI research and practical deployment in complex 3D environments.
Abstract
Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.
