Table of Contents
Fetching ...

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

Delin Qu, Haoming Song, Qizhi Chen, Zhaoqing Chen, Xianqiang Gao, Xinyi Ye, Qi Lv, Modi Shi, Guanghui Ren, Cheng Ruan, Maoqing Yao, Haoran Yang, Jiacheng Bao, Bin Zhao, Dong Wang

TL;DR

This work introduces EO-Robotics, a unified embodied foundation model (EO-1) trained with interleaved vision-text-action data (EO-Data1.5M) to achieve human-like interleaved reasoning and action in open-world robotics. The approach combines autoregressive text decoding with flow-matching action denoising in a single decoder, enabling seamless perception, planning, and manipulation across multiple embodiments. Extensive benchmarks and real-world experiments show strong open-world generalization, dexterous multi-robot manipulation, and superior performance on embodied reasoning tasks, supported by an openly released dataset, model weights, and code. The EO-Bench benchmark provides a principled, disentangled evaluation suite for open-world embodied reasoning, illustrating the practical impact of unified multimodal pretraining for general-purpose robotic systems.

Abstract

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

EO-1: Interleaved Vision-Text-Action Pretraining for General Robot Control

TL;DR

This work introduces EO-Robotics, a unified embodied foundation model (EO-1) trained with interleaved vision-text-action data (EO-Data1.5M) to achieve human-like interleaved reasoning and action in open-world robotics. The approach combines autoregressive text decoding with flow-matching action denoising in a single decoder, enabling seamless perception, planning, and manipulation across multiple embodiments. Extensive benchmarks and real-world experiments show strong open-world generalization, dexterous multi-robot manipulation, and superior performance on embodied reasoning tasks, supported by an openly released dataset, model weights, and code. The EO-Bench benchmark provides a principled, disentangled evaluation suite for open-world embodied reasoning, illustrating the practical impact of unified multimodal pretraining for general-purpose robotic systems.

Abstract

The human ability to seamlessly perform multimodal reasoning and physical interaction in the open world is a core goal for general-purpose embodied intelligent systems. Recent vision-language-action (VLA) models, which are co-trained on large-scale robot and visual-text data, have demonstrated notable progress in general robot control. However, they still fail to achieve human-level flexibility in interleaved reasoning and interaction. In this work, introduce EO-Robotics, consists of EO-1 model and EO-Data1.5M dataset. EO-1 is a unified embodied foundation model that achieves superior performance in multimodal embodied reasoning and robot control through interleaved vision-text-action pre-training. The development of EO-1 is based on two key pillars: (i) a unified architecture that processes multimodal inputs indiscriminately (image, text, video, and action), and (ii) a massive, high-quality multimodal embodied reasoning dataset, EO-Data1.5M, which contains over 1.5 million samples with emphasis on interleaved vision-text-action comprehension. EO-1 is trained through synergies between auto-regressive decoding and flow matching denoising on EO-Data1.5M, enabling seamless robot action generation and multimodal embodied reasoning. Extensive experiments demonstrate the effectiveness of interleaved vision-text-action learning for open-world understanding and generalization, validated through a variety of long-horizon, dexterous manipulation tasks across multiple embodiments. This paper details the architecture of EO-1, the data construction strategy of EO-Data1.5M, and the training methodology, offering valuable insights for developing advanced embodied foundation models.

Paper Structure

This paper contains 36 sections, 2 equations, 32 figures, 6 tables.

Figures (32)

  • Figure 1: EO-1 Model Architecture. EO-1 model is a Vision-Language-Action (VLA) model that adopts a single unified decoder-only transformer, equipping with discrete language-modeling head for multimodal embodied reasoning and continuous flow-matching head for robot action generation. The language instruction, image observations, robot state, and noisy action are encoded into an interleaved token sequence of tokens to be processed by the shared transformer backbone, whose weights are initialized from Qwen2.5-VL. The model is trained on interleaved vision-text-action data with a combination of flow-matching objective and next-token-prediction objective and capable of seamless embodied reasoning and acting.
  • Figure 2: Interleaved rectifying sampling strategy. Our method samples variable-length subsequences from robot action generation segments, enabling efficient training of mixed-modality generation while preserving causal relationships.
  • Figure 4: Examples of EO-Bench, including Multiview Pointing, Physical Common Sense, Trajectory Prediction, Process Verification, Task Planning, and Robot Affordance.
  • Figure 5: Example of real-world evaluation tasks in diverse robots, including Agibot G-1 Long-horizon Dexterous (row 1-4), Franka Panda Pick-and-Place (row 5), WidowX 250 S Out-of-Box (row 6), and Embodied Reasoning Control in Franka, Agibot G-1, and Lerobot SO100 (row 1,2,7,8).
  • Figure 6: Performance comparison on diverse robot platforms and task categories.
  • ...and 27 more figures