Table of Contents
Fetching ...

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

Rui Cai, Jun Guo, Xinze He, Piaopiao Jin, Jie Li, Bingxuan Lin, Futeng Liu, Wei Liu, Fei Ma, Kun Ma, Feng Qiu, Heng Qu, Yifei Su, Qiao Sun, Dong Wang, Donghao Wang, Yunhong Wang, Rujie Wu, Diyun Xiang, Yu Yang, Hangjun Ye, Yuan Zhang, Quanyun Zhou

TL;DR

Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM.

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io

Xiaomi-Robotics-0: An Open-Sourced Vision-Language-Action Model with Real-Time Execution

TL;DR

Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM.

Abstract

In this report, we introduce Xiaomi-Robotics-0, an advanced vision-language-action (VLA) model optimized for high performance and fast and smooth real-time execution. The key to our method lies in a carefully designed training recipe and deployment strategy. Xiaomi-Robotics-0 is first pre-trained on large-scale cross-embodiment robot trajectories and vision-language data, endowing it with broad and generalizable action-generation capabilities while avoiding catastrophic forgetting of the visual-semantic knowledge of the underlying pre-trained VLM. During post-training, we propose several techniques for training the VLA model for asynchronous execution to address the inference latency during real-robot rollouts. During deployment, we carefully align the timesteps of consecutive predicted action chunks to ensure continuous and seamless real-time rollouts. We evaluate Xiaomi-Robotics-0 extensively in simulation benchmarks and on two challenging real-robot tasks that require precise and dexterous bimanual manipulation. Results show that our method achieves state-of-the-art performance across all simulation benchmarks. Moreover, Xiaomi-Robotics-0 can roll out fast and smoothly on real robots using a consumer-grade GPU, achieving high success rates and throughput on both real-robot tasks. To facilitate future research, code and model checkpoints are open-sourced at https://xiaomi-robotics-0.github.io
Paper Structure (21 sections, 1 equation, 9 figures, 6 tables)

This paper contains 21 sections, 1 equation, 9 figures, 6 tables.

Figures (9)

  • Figure 1: Overview. Xiaomi-Robotics-0 achieves state-of-the-art performance in three widely-used simulation benchmarks. It also attains high throughput on two challenging real-robot bimanual manipulation tasks. Furthermore, it matches the underlying pre-trained VLM on several VLM benchmarks.
  • Figure 2: Data. Xiaomi-Robotics-0 leverages both robot trajectory data and vision-language (VL) data during pre-training.
  • Figure 3: Model & Training. (a) During the first step of pre-training, we train the VLM on both vision-language data (left) and robot trajectory data (right). Vision-language data are trained via a next-token-prediction objective. We adopt the training paradigm in Choice Policies qi2025coordinated to train the VLM for action prediction on the robot trajectory data. (b) In the second step of pre-training, we freeze the VLM and train the diffusion transformer for generating actions via flow-matching. (c) During post-training for asynchrnnous execution, we prepend clean action prefix to the noisy action tokens.
  • Figure 4: The $\Lambda$-Shape Attention Mask for Post-Training. A noisy action token can only attend to the vision and language tokens via the VLM KV cache, the sink token, the state token, and the action tokens of the previous $w$ timesteps. The number in each token indicates the RoPE positional index of the token. Note that we add an offset of 10 to the positional indices of the noisy action tokens to allow the model to distinguish them from the clean action prefix tokens.
  • Figure 5: Asynchronous Execution. We show two consecutive chunks and how they are stitched together during robot rollout. See Sec. \ref{['methods:training:deployment']} for more details.
  • ...and 4 more figures