Table of Contents
Fetching ...

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

Feng Yan, Fanfan Liu, Liming Zheng, Yufeng Zhong, Yiyang Huang, Zechao Guan, Chengjian Feng, Lin Ma

TL;DR

RoboTron-Mani introduces an all-in-one multimodal large model for robotic manipulation, integrating 3D perception, multimodal outputs, and cross-dataset training. The RoboData dataset unifies diverse robotic datasets with 3D alignment and standardized action representations, enabling joint learning across multiple embodiments. Empirical results show RoboTron-Mani achieving competitive or superior performance across simulated and real-world tasks, with strong cross-embodiment generalization and clear gains from 3D perception and modality isolation. This work proposes a scalable framework and evaluation standard that facilitates cross-domain embodied AI research and practical deployment in complex 3D environments.

Abstract

Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.

RoboTron-Mani: All-in-One Multimodal Large Model for Robotic Manipulation

TL;DR

RoboTron-Mani introduces an all-in-one multimodal large model for robotic manipulation, integrating 3D perception, multimodal outputs, and cross-dataset training. The RoboData dataset unifies diverse robotic datasets with 3D alignment and standardized action representations, enabling joint learning across multiple embodiments. Empirical results show RoboTron-Mani achieving competitive or superior performance across simulated and real-world tasks, with strong cross-embodiment generalization and clear gains from 3D perception and modality isolation. This work proposes a scalable framework and evaluation standard that facilitates cross-domain embodied AI research and practical deployment in complex 3D environments.

Abstract

Recently, robotics has advanced significantly through the integration of larger models and large-scale datasets. However, challenges remain in applying these models to 3D spatial interactions and managing data collection costs. To address these issues, we propose the multimodal robotic manipulation model RoboTron-Mani and the comprehensive dataset RoboData. RoboTron-Mani, on one hand, enhances 3D perception through camera parameters and occupancy supervision. On the other hand, it further incorporates Modality-Isolation-Mask and multimodal decoder blocks based on OpenFlamingo, improving modality fusion and fine-grained perception. RoboData integrats several publicly-available datasets, achieving the first fusion of multi-view images, camera parameters, depth maps, actions, and space alignment, which facilitates comprehensive learning from diverse robotic datasets and offers one complete evaluation system. Trained on RoboData, RoboTron-Mani is the first generalist policy that surpasses expert models, enabling simultaneous evaluation of all tasks across multiple datasets, rather than being limited to specific data or task selections. Specifically, RoboTron-Mani boosts manipulation performance by increasing the average sequence length on CALVIN from 1.7 to 3.5, enabling cross-embodiment generalization, and achieving state-of-the-art results on both simulated and real-world datasets.

Paper Structure

This paper contains 31 sections, 17 equations, 9 figures, 11 tables.

Figures (9)

  • Figure 1: Stacked bar chart depicting the performance of various models across five datasets. The "SOTA" label represents the best results achieved by specialized models for each dataset. Notably, RoboTron-Mani is the only generalist policy evaluated on multiple datasets, demonstrating competitive performance relative to "SOTA". For more details, please refer to Section \ref{['sec:sota']}.
  • Figure 2: Left:RoboData integrates multiple diverse and complex datasets (CALVIN mees2022calvin, Meta-World yu2020meta, LIBERO liu2024libero, RT-1 brohan2022rt, RoboCAS zheng2024robocas, ManiSkill2 gu2023maniskill2, RoboCasa nasiriany2024robocasa, RLBench james2020rlbench, and Colosseum pumacay2024colosseum), covering various robot embodiments, environments, and task types, resulting in a unified dataset with standardized input and output spaces. Right:RoboTron-Mani features comprehensive 3D perception capabilities, flexible multimodal outputs, and significantly enhances the robotic manipulation generalization capabilities.
  • Figure 3: Architecture of RoboTron-Mani. Vision Encoder extracts multi-view features. 3D Perception Adapter leverages occupancy supervision to unify features and enhance spatial perception. Feature Fusion Decoder based on LLMs merges text and visual information. Multimodal Decoders enhance fine-grained perception and understanding through multimodal outputs.
  • Figure 4: Overview of the multimodal decoders. (a) Image Decoder, (b) Occupancy Decoder, and (c) Action Decoder. Each decoder processes input features through a series of Multi-Layer Perceptrons (MLPs), attention mechanisms, and convolutional neural networks (CNNs) to generate appropriate output representations.
  • Figure 5: Left: Modality Isolation Mask (MIM). The KQ mask structure regulates attention interactions among different modalities (e.g., <text>, <image>, <action>). Dark squares indicate allowed attention connections between keys (K) and queries (Q), while white squares denote prohibited attention, ensuring modality isolation. Right: Frequency of Tasks. This section illustrates the distribution of tasks within the dataset, detailing the number of episodes associated with each task. The bars represent the frequency of various tasks, including "place," "pick," and "turn," highlighting the diversity and focus areas of the dataset. The y-axis indicates the number of episodes, emphasizing the relative frequency of each task.
  • ...and 4 more figures