Table of Contents
Fetching ...

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

En Yu, Haoran Lv, Jianjian Sun, Kangheng Lin, Ruitao Zhang, Yukang Shi, Yuyang Chen, Ze Chen, Ziheng Zhang, Fan Jia, Kaixin Liu, Meng Zhang, Ruitao Hao, Saike Huang, Songhan Xie, Yu Liu, Zhao Wu, Bin Xie, Pengwei Zhang, Qi Yang, Xianchi Deng, Yunfei Wei, Enwen Zhang, Hongyang Peng, Jie Zhao, Kai Liu, Wei Sun, Yajun Wei, Yi Yang, Yunqiao Zhang, Ziwei Yan, Haitao Yang, Hao Liu, Haoqiang Fan, Haowei Zhang, Junwen Huang, Yang Chen, Yunchao Ma, Yunhuan Yang, Zhengyuan Du, Ziming Liu, Jiahui Niu, Yucheng Zhao, Daxin Jiang, Wenbin Tang, Xiangyu Zhang, Zheng Ge, Erjin Zhou, Tiancai Wang

TL;DR

DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI, unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset.

Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.

DM0: An Embodied-Native Vision-Language-Action Model towards Physical AI

TL;DR

DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI, unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset.

Abstract

Moving beyond the traditional paradigm of adapting internet-pretrained models to physical tasks, we present DM0, an Embodied-Native Vision-Language-Action (VLA) framework designed for Physical AI. Unlike approaches that treat physical grounding as a fine-tuning afterthought, DM0 unifies embodied manipulation and navigation by learning from heterogeneous data sources from the onset. Our methodology follows a comprehensive three-stage pipeline: Pretraining, Mid-Training, and Post-Training. First, we conduct large-scale unified pretraining on the Vision-Language Model (VLM) using diverse corpora--seamlessly integrating web text, autonomous driving scenarios, and embodied interaction logs-to jointly acquire semantic knowledge and physical priors. Subsequently, we build a flow-matching action expert atop the VLM. To reconcile high-level reasoning with low-level control, DM0 employs a hybrid training strategy: for embodied data, gradients from the action expert are not backpropagated to the VLM to preserve generalized representations, while the VLM remains trainable on non-embodied data. Furthermore, we introduce an Embodied Spatial Scaffolding strategy to construct spatial Chain-of-Thought (CoT) reasoning, effectively constraining the action solution space. Experiments on the RoboChallenge benchmark demonstrate that DM0 achieves state-of-the-art performance in both Specialist and Generalist settings on Table30.
Paper Structure (38 sections, 4 equations, 4 figures, 6 tables)

This paper contains 38 sections, 4 equations, 4 figures, 6 tables.

Figures (4)

  • Figure 1: DM0 is pretrained on diverse corpora—seamlessly integrating web, autonomous driving and embodied data. It jointly acquire semantic knowledge and physical priors.
  • Figure 2: Model Architecture. DM0 consists of a vision-language model (VLM) backbone and a Flow Matching lipman2022flowmatching based action expert. The VLM processes multi-modal inputs and generates embodied reasoning representations, which are subsequently consumed by the action expert to produce continuous robot actions.
  • Figure 3: Overview of Curated Vision-Language Data. The curated dataset is designed to enhance core embodied reasoning abilities while preserving the general multimodal understanding and reasoning capabilities of the pretrained VLM.
  • Figure 4: Data mixture ratios for pretraining, midtraining and post-training. After weighted sampling, the total data volumes for the three stages are 1.13T tokens (pretraining), 200M samples (mid-training), and 50M samples (post-training).