Table of Contents
Fetching ...

Galaxea Open-World Dataset and G0 Dual-System VLA Model

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao

TL;DR

This work addresses the scarcity of open-world robotic data by introducing the Galaxea Open-World Dataset, a large, richly annotated real-world corpus collected on a single embodiment. It pairs this dataset with G0, a dual-system robotics framework where a Vision-Language Model (System-2) plans actions and a Vision-Language-Action model (System-1) executes them, trained via a three-stage curriculum that includes cross-embodiment and single-embodiment pre-training plus task-focused post-training. Through extensive benchmarks across tabletop and long-horizon manipulation, the authors show that single-embodiment pre-training on Galaxea is crucial for strong performance, and that fine-tuned G0-VLM/G0-VLA achieve state-of-the-art results on diverse tasks. The work also emphasizes the importance of dataset quality and controlled ablations for understanding when cross-embodiment data helps, and it contributes open-source datasets and models to advance robust embodied AI.

Abstract

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Galaxea Open-World Dataset and G0 Dual-System VLA Model

TL;DR

This work addresses the scarcity of open-world robotic data by introducing the Galaxea Open-World Dataset, a large, richly annotated real-world corpus collected on a single embodiment. It pairs this dataset with G0, a dual-system robotics framework where a Vision-Language Model (System-2) plans actions and a Vision-Language-Action model (System-1) executes them, trained via a three-stage curriculum that includes cross-embodiment and single-embodiment pre-training plus task-focused post-training. Through extensive benchmarks across tabletop and long-horizon manipulation, the authors show that single-embodiment pre-training on Galaxea is crucial for strong performance, and that fine-tuned G0-VLM/G0-VLA achieve state-of-the-art results on diverse tasks. The work also emphasizes the importance of dataset quality and controlled ablations for understanding when cross-embodiment data helps, and it contributes open-source datasets and models to advance robust embodied AI.

Abstract

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Paper Structure

This paper contains 23 sections, 3 equations, 11 figures, 1 table.

Figures (11)

  • Figure 1: We introduce Galaxea Open-World Dataset, a high-quality robot behavior dataset collected in the open world. Building on this dataset, we propose G0, a dual system which is composed of a VLM for slow thinking and a VLA model for fast execution.
  • Figure 2: Galaxea Open-World Dataset is collected by a fleet of robots with identical embodiments, operating across diverse real-world environments.
  • Figure 3: Data diversity statistics.(a) The distribution of total interaction time is shown across the four primary scene categories: Residential, Retail, Catering, and Office. (b) Trajectory counts are presented for a rich collection of object subcategories, which are organized into broader classes like Electronics, Household, and Furniture, showcasing the dataset's wide range of interactive items.
  • Figure 4: Task statistics. This figure illustrates the temporal and structural properties of the tasks within the dataset. (a) The distribution of task completion times reveals that most tasks are of moderate length, yet the dataset also contains a long tail of complex, long-horizon activities. (b) Task complexity, measured by the number of subtasks per task, is shown to vary widely, covering everything from simple actions to intricate multi-step procedures.
  • Figure 5: Embodied behavior statistics.(a) A breakdown of interaction time by body part usage illustrates the variety of motions, from simple 'Arms Only' manipulations to coordinated 'Whole Body' movements. (b) The long-tail distribution of skills highlights the dataset's rich action vocabulary, covering both frequent, fundamental actions (e.g., 'pick', 'place') and a wide array of more specialized skills.
  • ...and 6 more figures