Table of Contents
Fetching ...

SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

Haokun Zhu, Zongtai Li, Zihan Liu, Kevin Guo, Zhengzhi Lin, Yuxin Cai, Guofei Chen, Chen Lv, Wenshan Wang, Jean Oh, Ji Zhang

TL;DR

This work forms real-world ObjectNav as a system-level problem and introduces SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment, which is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments.

Abstract

Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.

SysNav: Multi-Level Systematic Cooperation Enables Real-World, Cross-Embodiment Object Navigation

TL;DR

This work forms real-world ObjectNav as a system-level problem and introduces SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment, which is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments.

Abstract

Object navigation (ObjectNav) in real-world environments is a complex problem that requires simultaneously addressing multiple challenges, including complex spatial structure, long-horizon planning and semantic understanding. Recent advances in Vision-Language Models (VLMs) offer promising capabilities for semantic understanding, yet effectively integrating them into real-world navigation systems remains a non-trivial challenge. In this work, we formulate real-world ObjectNav as a system-level problem and introduce SysNav, a three-level ObjectNav system designed for real-world crossembodiment deployment. SysNav decouples semantic reasoning, navigation planning and motion control to ensure robustness and generalizability. At the high-level, we summarize the environment into a structured scene representation and leverage VLMs to provide semantic-grounded navigation guidance. At the mid-level, we introduce a hierarchical room-based navigation strategy that reserves VLM guidance for room-level decisions, which effectively utilizes its reasoning ability while ensuring system efficiency. At the low-level, planned waypoints are executed through different embodiment-specific motion control modules. We deploy our system on three embodiments, a custom-built wheeled robot, the Unitree Go2 quadruped and the Unitree G1 humanoid, and conduct 190 real-world experiments. Our system achieves substantial improvements in both success rate and navigation efficiency. To the best of our knowledge, SysNav is the first system capable of reliably and efficiently completing building-scale long-range object navigation in complex real-world environments. Furthermore, extensive experiments on four simulation benchmarks demonstrate state-of-the-art performance. Project page is available at: https://cmu-vln.github.io/.
Paper Structure (20 sections, 3 figures, 2 tables)

This paper contains 20 sections, 3 figures, 2 tables.

Figures (3)

  • Figure 1: SysNav achieves object navigation in diverse real-world environments and generalizes across multiple embodiments. To our knowledge, it is the first system to reliably and efficiently complete object navigation at building-scale.
  • Figure 2: Overview of the proposed SysNav. The high-level Semantic Reasoning module organizes environmental information into a structured scene representation (\ref{['sec:method_scenerepresentation']}) and leverages VLM reasoning to provide semantic-grounded navigation guidance (\ref{['sec:method_vlmreasoning']}). The mid-level Room-based Navigation module performs hierarchical navigation with in-room exploration (\ref{['sec:method_inroom']}) and cross-room navigation (\ref{['sec:method_crossroom']}). The low-level Base Autonomy module executes planned waypoints through embodiment-specific motion control (\ref{['sec:method_lowlevel']}).
  • Figure 3: Qualitative results from real-world deployment of SysNav. The first three rows show building-scale object navigation with multiple constraints on a wheeled robot, including step-by-step VLM reasoning analysis. The last five rows demonstrate cross-embodiment performance on quadruped and humanoid robots. For each episode, we visualize the final scene representation, first-person view and global overview at task completion (zoom in for details of $\mathcal{R}$). Red denotes self-attribute constraints, blue denotes spatial relationship constraints and orange denotes target object categories.