Table of Contents
Fetching ...

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

Yuheng Ji, Huajie Tan, Jiayu Shi, Xiaoshuai Hao, Yuan Zhang, Hengyuan Zhang, Pengwei Wang, Mengdi Zhao, Yao Mu, Pengju An, Xinda Xue, Qinghang Su, Huaihai Lyu, Xiaolong Zheng, Jiaming Liu, Zhongyuan Wang, Shanghang Zhang

TL;DR

RoboBrain addresses core limitations of Multimodal Large Language Models in robotics by introducing ShareRobot, a fine-grained, multi-dimensional dataset for task planning, affordance, and trajectory, and a three-module RoboBrain model (planning, affordance, trajectory). Through a two-phase training regime that blends large-scale general multimodal data with robotic-specific data and long videos, RoboBrain achieves state-of-the-art results on RoboVQA, OpenEQA, and ShareRobot benchmarks, while also delivering concrete affordance regions and manipulation trajectories. The work demonstrates that carefully curated, high-quality data and targeted LoRA-based specialization can significantly enhance long-horizon robotic manipulation capabilities, with potential impact on real-world autonomous manipulation tasks. Overall, ShareRobot and RoboBrain validate a scalable path toward a unified robotic brain capable of translating abstract instructions into concrete, executable actions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.

RoboBrain: A Unified Brain Model for Robotic Manipulation from Abstract to Concrete

TL;DR

RoboBrain addresses core limitations of Multimodal Large Language Models in robotics by introducing ShareRobot, a fine-grained, multi-dimensional dataset for task planning, affordance, and trajectory, and a three-module RoboBrain model (planning, affordance, trajectory). Through a two-phase training regime that blends large-scale general multimodal data with robotic-specific data and long videos, RoboBrain achieves state-of-the-art results on RoboVQA, OpenEQA, and ShareRobot benchmarks, while also delivering concrete affordance regions and manipulation trajectories. The work demonstrates that carefully curated, high-quality data and targeted LoRA-based specialization can significantly enhance long-horizon robotic manipulation capabilities, with potential impact on real-world autonomous manipulation tasks. Overall, ShareRobot and RoboBrain validate a scalable path toward a unified robotic brain capable of translating abstract instructions into concrete, executable actions.

Abstract

Recent advancements in Multimodal Large Language Models (MLLMs) have shown remarkable capabilities across various multimodal contexts. However, their application in robotic scenarios, particularly for long-horizon manipulation tasks, reveals significant limitations. These limitations arise from the current MLLMs lacking three essential robotic brain capabilities: Planning Capability, which involves decomposing complex manipulation instructions into manageable sub-tasks; Affordance Perception, the ability to recognize and interpret the affordances of interactive objects; and Trajectory Prediction, the foresight to anticipate the complete manipulation trajectory necessary for successful execution. To enhance the robotic brain's core capabilities from abstract to concrete, we introduce ShareRobot, a high-quality heterogeneous dataset that labels multi-dimensional information such as task planning, object affordance, and end-effector trajectory. ShareRobot's diversity and accuracy have been meticulously refined by three human annotators. Building on this dataset, we developed RoboBrain, an MLLM-based model that combines robotic and general multi-modal data, utilizes a multi-stage training strategy, and incorporates long videos and high-resolution images to improve its robotic manipulation capabilities. Extensive experiments demonstrate that RoboBrain achieves state-of-the-art performance across various robotic tasks, highlighting its potential to advance robotic brain capabilities.

Paper Structure

This paper contains 35 sections, 12 figures, 9 tables.

Figures (12)

  • Figure 1: Overview of RoboBrain. RoboBrain consists of three key robotic capabilities: planning capability, affordance perception, and trajectory prediction. RoboBrain outperforms previous MLLMs in robotics tasks. The bottom part shows the composition of RoboBrain's training data and provides a specific example of visual question answering from our proposed ShareRobot. Best viewed on screen.
  • Figure 2: The generation procession of our ShareRobot dataset. Our dataset labels multi-dimensional information, including task planning, object affordance, and end-effector trajectories. The task planning is first annotated by atomic tasks and then augmented by constructing question-answer pairs. The affordance and trajectory are labeled on the images according to the specific instructions.
  • Figure 3: The diversity of our ShareRobot dataset. Our dataset involves (a) 23 original datasets, (b) 12 embodiments and (c) 107 types of atomic tasks. The distribution of the top 20 most frequent atomic actions within our ShareRobot dataset is presented in (c).
  • Figure 4: The pipeline of our RoboBrain. The images, multiple images, and videos are sent into our model to pre-train a foundation robotic brain. Besides, we fine-tune the RoboBrain via A-LoRA and T-LoRA to develop affordance and trajectory skills. In practical applications, the model first generates detailed plans, and then splits it into sub-task descriptions to execute specific robotic tasks.
  • Figure 5: The performance of our model RoboBrain on the OpenEQA, ShareRobot, and RoboVQA benchmarks. RoboBrain surpassed all baseline models, achieving state-of-the-art results.
  • ...and 7 more figures