Galaxea Open-World Dataset and G0 Dual-System VLA Model

Tao Jiang; Tianyuan Yuan; Yicheng Liu; Chenhao Lu; Jianning Cui; Xiao Liu; Shuiqi Cheng; Jiyang Gao; Huazhe Xu; Hang Zhao

Galaxea Open-World Dataset and G0 Dual-System VLA Model

Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao

TL;DR

This work addresses the scarcity of open-world robotic data by introducing the Galaxea Open-World Dataset, a large, richly annotated real-world corpus collected on a single embodiment. It pairs this dataset with G0, a dual-system robotics framework where a Vision-Language Model (System-2) plans actions and a Vision-Language-Action model (System-1) executes them, trained via a three-stage curriculum that includes cross-embodiment and single-embodiment pre-training plus task-focused post-training. Through extensive benchmarks across tabletop and long-horizon manipulation, the authors show that single-embodiment pre-training on Galaxea is crucial for strong performance, and that fine-tuned G0-VLM/G0-VLA achieve state-of-the-art results on diverse tasks. The work also emphasizes the importance of dataset quality and controlled ablations for understanding when cross-embodiment data helps, and it contributes open-source datasets and models to advance robust embodied AI.

Abstract

We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.

Galaxea Open-World Dataset and G0 Dual-System VLA Model

TL;DR

Abstract

Galaxea Open-World Dataset and G0 Dual-System VLA Model

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (11)