Galaxea Open-World Dataset and G0 Dual-System VLA Model
Tao Jiang, Tianyuan Yuan, Yicheng Liu, Chenhao Lu, Jianning Cui, Xiao Liu, Shuiqi Cheng, Jiyang Gao, Huazhe Xu, Hang Zhao
TL;DR
This work addresses the scarcity of open-world robotic data by introducing the Galaxea Open-World Dataset, a large, richly annotated real-world corpus collected on a single embodiment. It pairs this dataset with G0, a dual-system robotics framework where a Vision-Language Model (System-2) plans actions and a Vision-Language-Action model (System-1) executes them, trained via a three-stage curriculum that includes cross-embodiment and single-embodiment pre-training plus task-focused post-training. Through extensive benchmarks across tabletop and long-horizon manipulation, the authors show that single-embodiment pre-training on Galaxea is crucial for strong performance, and that fine-tuned G0-VLM/G0-VLA achieve state-of-the-art results on diverse tasks. The work also emphasizes the importance of dataset quality and controlled ablations for understanding when cross-embodiment data helps, and it contributes open-source datasets and models to advance robust embodied AI.
Abstract
We present Galaxea Open-World Dataset, a large-scale, diverse collection of robot behaviors recorded in authentic human living and working environments. All demonstrations are gathered using a consistent robotic embodiment, paired with precise subtask-level language annotations to facilitate both training and evaluation. Building on this dataset, we introduce G0, a dual-system framework that couples a Vision-Language Model (VLM) for multimodal planning with a Vision-Language-Action (VLA) model for fine-grained execution. G0 is trained using a three-stage curriculum: cross-embodiment pre-training, single-embodiment pre-training, and task-specific post-training. A comprehensive benchmark spanning tabletop manipulation, few-shot learning, and long-horizon mobile manipulation, demonstrates the effectiveness of our approach. In particular, we find that the single-embodiment pre-training stage, together with the Galaxea Open-World Dataset, plays a critical role in achieving strong performance.
