ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

Ziyang Gong; Zehang Luo; Anke Tang; Zhe Liu; Shi Fu; Zhi Hou; Ganlin Yang; Weiyun Wang; Xiaofeng Wang; Jianbo Liu; Gen Luo; Haolan Kang; Shuang Luo; Yue Zhou; Yong Luo; Li Shen; Xiaosong Jia; Yao Mu; Xue Yang; Chunxiao Liu; Junchi Yan; Hengshuang Zhao; Dacheng Tao; Xiaogang Wang

ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

Ziyang Gong, Zehang Luo, Anke Tang, Zhe Liu, Shi Fu, Zhi Hou, Ganlin Yang, Weiyun Wang, Xiaofeng Wang, Jianbo Liu, Gen Luo, Haolan Kang, Shuang Luo, Yue Zhou, Yong Luo, Li Shen, Xiaosong Jia, Yao Mu, Xue Yang, Chunxiao Liu, Junchi Yan, Hengshuang Zhao, Dacheng Tao, Xiaogang Wang

TL;DR

The Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging is proposed, and ACE-Brain-0 is introduced to strengthen the model's comprehensive capability.

Abstract

Universal embodied intelligence demands robust generalization across heterogeneous embodiments, such as autonomous driving, robotics, and unmanned aerial vehicles (UAVs). However, existing embodied brain in training a unified model over diverse embodiments frequently triggers long-tail data, gradient interference, and catastrophic forgetting, making it notoriously difficult to balance universal generalization with domain-specific proficiency. In this report, we introduce ACE-Brain-0, a generalist foundation brain that unifies spatial reasoning, autonomous driving, and embodied manipulation within a single multimodal large language model~(MLLM). Our key insight is that spatial intelligence serves as a universal scaffold across diverse physical embodiments: although vehicles, robots, and UAVs differ drastically in morphology, they share a common need for modeling 3D mental space, making spatial cognition a natural, domain-agnostic foundation for cross-embodiment transfer. Building on this insight, we propose the Scaffold-Specialize-Reconcile~(SSR) paradigm, which first establishes a shared spatial foundation, then cultivates domain-specialized experts, and finally harmonizes them through data-free model merging. Furthermore, we adopt Group Relative Policy Optimization~(GRPO) to strengthen the model's comprehensive capability. Extensive experiments demonstrate that ACE-Brain-0 achieves competitive and even state-of-the-art performance across 24 spatial and embodiment-related benchmarks.

ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

TL;DR

Abstract

Paper Structure (40 sections, 2 theorems, 47 equations, 38 figures, 8 tables)

This paper contains 40 sections, 2 theorems, 47 equations, 38 figures, 8 tables.

Introduction
ACE-Brain-0 Architecture
Task Formulation
Multimodal Architecture
Multimodal Autoregressive Objective
Training Strategy
Stage 1: Spatial Scaffold Training
Stage 2: Supervised Specialized Expert Fine-Tuning
Stage 3: Across-Embodiment Reconcile Model Merging
Stage 4: Supervised Fine-Tuning on Embodied Data
Stage 5: Reinforcement Learning with GRPO
Experiments
Spatial Intelligence
Autonomous Driving Intelligence
Low-Altitude Intelligence
...and 25 more sections

Key Result

Theorem 1

Let $w\in\Delta_K$ be nonnegative weights, and define the joint risk $R_w(\theta):=\sum_{j=1}^K w_j R_j(\theta)$. Consider one shared update Under Assumption ass:smooth_aligned, for any morphology $i\in\mathcal{M}$,

Figures (38)

Figure 1: Cross-Embodiment Learning Paradigm of ACE-Brain-0 and Performance Comparison with other Embodied Brains. ACE-Brain-0 unifies tasks from four domains, Spatial Cognition, Autonomous Driving, Low-Altitude Sensing, and Embodied Manipulation. We hope to answer: "How can we instill and unify these capabilities within a single embodied foundation brain?" Conventional joint training mixes multi-domain data with shared parameters, which often causes gradient interference across tasks; sequential training accumulates skills via stage-wise fine-tuning, but tends to overwrite previously learned capabilities and leads to catastrophic forgetting. In contrast, we propose our Scaffold-Specialize-Reconcile paradigm: We first construct a Spatial Expert as a universal foundational model, then train the AD and UAV experts separately to acquire domain-specific skills while enabling coarse-grained spatial reasoning, and subsequently combine their expertise into a unified model via data-free expert merging. We further perform Embodied SFT, optionally followed by GRPO-based RFT for reward-guided post-training alignment. This pipeline delivers consistent and stable improvements across all four domains. The radar chart on the right further compares ACE-Brain-0 against representative embodied brains across multiple benchmarks, showing stronger overall performance on a broader set of tasks, and validating the unified cross-embodiment capability of ACE-Brain-0.
Figure 2: Overview of ACE-Brain-0 Capabilities. ACE-Brain-0 is a spatial-centric foundation brain that supports Spatial Intelligence, Embodied Manipulation, Low-Altitude Sensing, and Autonomous Driving. Specifically, ACE-Brain-0 is evaluated on 7 benchmarks for Spatial Cognition, 6 benchmarks for Autonomous Driving, 5 benchmarks for Low-Altitude Sensing, and 6 benchmarks for Embodied Interaction. ACE-Brain-0's ability to integrate perception, decision, and planning across diverse real-world embodied scenarios, highlighting its generalization capability as a universal embodied intelligence model.
Figure 3: ACE-Brain-0’s unified multimodal architecture and cross-domain capability coverage. ACE-Brain-0 supports inputs including single-view images, multi-view images, and videos; the instruction examples illustrate that the model can perform Q&A-style tasks across domains (General/Spatial/Driving/Aerial/Embodied). The top row summarizes ACE-Brain-0’s core capability spectrum for cross-embodiment scenarios, such as Spatial Perception and Temporal Modeling, enabling unified representation and compositional generalization across domains.
Figure 4: Domain distribution and Token count. This nested pie chart illustrates the proportion of tokens contributed by each domain. The distribution exhibits a long-tailed characteristic, where UAV data constitutes a relatively small proportion of the corpus.
Figure 5: Example 1 of VSI Benchmark.
...and 33 more figures

Theorems & Definitions (8)

Remark 1
Remark 2: Neuroscientific evidence supporting reusable spatial scaffolds and representation space separation
Theorem 1: One-step interference bound
proof : Proof of Theorem \ref{['thm:interference_aligned']}
Remark 3: Connection to isolation and gradient interference
Theorem 2: Scaffold-to-morphology transfer bound
proof : Proof of Theorem \ref{['thm:scaffold_transfer_aligned']}
Remark 4: What the bound captures and its experimental implications

ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

TL;DR

Abstract

ACE-Brain-0: Spatial Intelligence as a Shared Scaffold for Universal Embodiments

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (38)

Theorems & Definitions (8)