Table of Contents
Fetching ...

MiMo-Embodied: X-Embodied Foundation Model Technical Report

Xiaoshuai Hao, Lei Zhou, Zhijian Huang, Zhiwen Hou, Yingbo Tang, Lingfeng Zhang, Guang Li, Zheng Lu, Shuhuai Ren, Xianhui Meng, Yuchen Zhang, Jing Wu, Jinghui Lu, Chenxu Dang, Jiayi Guan, Jianhua Wu, Zhiyi Hou, Hanbing Li, Shumeng Xia, Mingliang Zhou, Yinan Zheng, Zihao Yue, Shuhao Gu, Hao Tian, Yuannan Shen, Jianwei Cui, Wen Zhang, Shaoqing Xu, Bing Wang, Haiyang Sun, Zeyu Zhu, Yuncheng Jiang, Zibin Guo, Chuhong Gong, Chaofan Zhang, Wenbo Ding, Kun Ma, Guang Chen, Rui Cai, Diyun Xiang, Heng Qu, Fuli Luo, Hangjun Ye, Long Chen

TL;DR

MiMo-Embodied addresses the need for a unified multimodal system capable of both embodied AI and autonomous driving. It employs a four-stage training pipeline with general, embodied AI, and autonomous driving data, augmented by chain-of-thought fine-tuning and GRPO-based reinforcement learning to enable cross-domain transfer. The model architecture combines a Vision Transformer encoder, a latent-space projector, and a Large Language Model, initialized from MiMo-VL, enabling robust perception, reasoning, and planning across tasks. Evaluation across 29 benchmarks shows state-of-the-art or near-SOTA performance in both domains, demonstrating practical potential for integrated robotic perception and decision-making and enabling future multi-modal, cross-domain AI systems.

Abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

MiMo-Embodied: X-Embodied Foundation Model Technical Report

TL;DR

MiMo-Embodied addresses the need for a unified multimodal system capable of both embodied AI and autonomous driving. It employs a four-stage training pipeline with general, embodied AI, and autonomous driving data, augmented by chain-of-thought fine-tuning and GRPO-based reinforcement learning to enable cross-domain transfer. The model architecture combines a Vision Transformer encoder, a latent-space projector, and a Large Language Model, initialized from MiMo-VL, enabling robust perception, reasoning, and planning across tasks. Evaluation across 29 benchmarks shows state-of-the-art or near-SOTA performance in both domains, demonstrating practical potential for integrated robotic perception and decision-making and enabling future multi-modal, cross-domain AI systems.

Abstract

We open-source MiMo-Embodied, the first cross-embodied foundation model to successfully integrate and achieve state-of-the-art performance in both Autonomous Driving and Embodied AI. MiMo-Embodied sets new records across 17 embodied AI benchmarks in Task Planning, Affordance Prediction and Spatial Understanding, while also excelling in 12 autonomous driving benchmarks across Environmental Perception, Status Prediction, and Driving Planning. Across these tasks, MiMo-Embodied significantly outperforms existing open-source, closed-source, and specialized baselines. Our results indicate that through multi-stage learning, curated data construction, and CoT/RL fine-tuning, these two domains exhibit strong positive transfer and mutually reinforce one another. We provide a detailed analysis of our model design and training methodologies to facilitate further research. Code and models are available at https://github.com/XiaomiMiMo/MiMo-Embodied.

Paper Structure

This paper contains 40 sections, 49 figures, 8 tables.

Figures (49)

  • Figure 1: Performance Comparison in Autonomous Driving and Embodied AI Benchmarks. MiMo-Embodied achieves state-of-the-art performance on both benchmarks, surpassing previous open-source, closed-source, and specialized VLMs, highlighting its superior capabilities in various autonomous driving and embodied AI tasks.
  • Figure 2: Overview of MiMo-Embodied Capabilities. MiMo-Embodied supports both Autonomous Driving and Embodied AI tasks, featuring 12 benchmarks in Autonomous Driving that cover Environmental Perception, Status Prediction and Driving Planning, along with 17 benchmarks in Embodied AI tasks focusing on Affordance Prediction, Task Planning, and Spatial Understanding.
  • Figure 3: Model architecture of MiMo-Embodied. The MiMo-Embodied model architecture is designed for embodied AI and autonomous driving tasks, effectively processing single images, multiple images, and videos, and consists of three main components: (1) a Vision Transformer for encoding visual inputs; (2) a projector that maps visual encodings to a latent space aligned with a LLM; and (3) the LLM itself for textual understanding and reasoning.
  • Figure 4: Overview of the Training Data used by MiMo-Embodied. Our model comprises three core components of training datasets: the General Dataset establishes foundational capabilities, the Embodied AI Dataset enhances capabilities in affordance, planning, and spatial perception, and the Autonomous Driving Dataset focuses on improving capabilities in perception, prediction, and planning for autonomous driving.
  • Figure 5: Results of deploying MiMo-Embodied to downstream embodied navigation tasks. The target positions are indicated by cyan points.
  • ...and 44 more figures