Table of Contents
Fetching ...

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

Zedong Chu, Shichao Xie, Xiaolong Wu, Yanfen Shen, Minghua Luo, Zhengbo Wang, Fei Liu, Xiaoxu Leng, Junjun Hu, Mingyang Yin, Jia Lu, Yingnan Guo, Kai Yang, Jiawei Han, Xu Chen, Yanqing Zhu, Yuxiang Zhao, Xin Liu, Yirong Yang, Ye He, Jiahang Wang, Yang Cai, Tianlin Zhang, Li Gao, Liu Liu, Mingchao Sun, Fan Jiang, Chiyu Wang, Zhicheng Liu, Hongyu Pan, Honglin Han, Zhining Gu, Kuan Yang, Jianfang Zhang, Di Jing, Zihao Guan, Wei Guo, Guoqing Liu, Di Yang, Xiangpo Yang, Menglin Yang, Hongguang Xing, Weiguo Li, Mu Xu

TL;DR

ABot-N0 proposes a unified Vision-Language-Action foundation model for versatile embodied navigation by coupling a Cognitive Brain (LLM) with a Flow-Matching Action Expert in a Brain-Action hierarchy. A three-layer Data Engine (7,802 high-fidelity scenes; 16.9M trajectories; 5.0M reasoning samples) fuels cross-task generalization across Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following, achieving state-of-the-art results on seven benchmarks. The Agentic Navigation System augments ABot-N0 with a Planner, hierarchical Topo-Memory, and a Closed-loop Self-Reflector for long-horizon, robust behavior in dynamic real-world settings, including deployment on a Unitree Go2 with onboard inference. Together, these components demonstrate strong cross-task generalization, scalable data synthesis, and practical real-world robustness, enabling socially aware, long-horizon autonomous navigation across indoor and outdoor environments.

Abstract

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 $\text{km}^2$). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.

ABot-N0: Technical Report on the VLA Foundation Model for Versatile Embodied Navigation

TL;DR

ABot-N0 proposes a unified Vision-Language-Action foundation model for versatile embodied navigation by coupling a Cognitive Brain (LLM) with a Flow-Matching Action Expert in a Brain-Action hierarchy. A three-layer Data Engine (7,802 high-fidelity scenes; 16.9M trajectories; 5.0M reasoning samples) fuels cross-task generalization across Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following, achieving state-of-the-art results on seven benchmarks. The Agentic Navigation System augments ABot-N0 with a Planner, hierarchical Topo-Memory, and a Closed-loop Self-Reflector for long-horizon, robust behavior in dynamic real-world settings, including deployment on a Unitree Go2 with onboard inference. Together, these components demonstrate strong cross-task generalization, scalable data synthesis, and practical real-world robustness, enabling socially aware, long-horizon autonomous navigation across indoor and outdoor environments.

Abstract

Embodied navigation has long been fragmented by task-specific architectures. We introduce ABot-N0, a unified Vision-Language-Action (VLA) foundation model that achieves a ``Grand Unification'' across 5 core tasks: Point-Goal, Object-Goal, Instruction-Following, POI-Goal, and Person-Following. ABot-N0 utilizes a hierarchical ``Brain-Action'' architecture, pairing an LLM-based Cognitive Brain for semantic reasoning with a Flow Matching-based Action Expert for precise, continuous trajectory generation. To support large-scale learning, we developed the ABot-N0 Data Engine, curating 16.9M expert trajectories and 5.0M reasoning samples across 7,802 high-fidelity 3D scenes (10.7 ). ABot-N0 achieves new SOTA performance across 7 benchmarks, significantly outperforming specialized models. Furthermore, our Agentic Navigation System integrates a planner with hierarchical topological memory, enabling robust, long-horizon missions in dynamic real-world environments.
Paper Structure (85 sections, 7 equations, 21 figures, 6 tables)

This paper contains 85 sections, 7 equations, 21 figures, 6 tables.

Figures (21)

  • Figure 1: ABot-N0: A unified VLA foundation model for versatile embodied navigation. Powered by a massive dataset of 16.9M expert trajectories and 5M reasoning samples across diverse environments, the model achieves a "Grand Unification" across five core navigation tasks. It establishes new state-of-the-art performance across 7 challenging benchmarks and is successfully deployed in complex, dynamic real-world agentic navigation systems.
  • Figure 2: The Architecture of ABot-N0. The model adopts a hierarchical "Brain-Action" design. The Universal Multi-Modal Encoder unifies heterogeneous inputs (RGB observations, visual history, and goal specifications) into a shared token sequence. The Cognitive Brain (i.e., LLM) encodes these tokens and supports dual-mode operation: a Reasoning Head for high-level semantic understanding and an Action Head for motion planning. The Action Expert employs Flow Matching to generate trajectory distributions, enabling generalization across five navigation tasks.
  • Figure 3: High-Fidelity 3D Scene Ecosystem. Our data engine integrates diverse indoor and outdoor environments. Left: Indoor scenes span residential spaces to large-scale public venues (offices, malls, transit stations). Right: Outdoor scenes include real-world scans of intersections and parks, alongside the dynamic virtual city SocCity. All scenes are annotated with traversable navigation graphs for collision-free trajectories generation.
  • Figure 4: 3D Scene Ecosystem Statistics. Our collection comprises 7,802 high-fidelity 3D scenes, covering 6.25 $\bm{km^2}$ of indoor environments (offices, malls, stations, homes) and 4.42 $\bm{km^2}$ of outdoor environments (intersections, parks, city). Navigation graphs totaling 384,754 meters enable collision-free trajectory synthesis across diverse spatial scales—from compact residential units to expansive transit hubs and urban environments.
  • Figure 5: Point-Goal Navigation Data Pipeline. Our 4.0M point-goal trajectory dataset aggregates three complementary streams: (Top) 2.0M pseudo-trajectories from first-person internet videos via 3D structure recovery and metric alignment; (Middle) 1.7M synthetic trajectories from high-fidelity 3D scenes with annotated navigation graphs; (Bottom) 340K real-world robot demonstrations from heterogeneous datasets, providing ground-truth physical dynamics.
  • ...and 16 more figures