Table of Contents
Fetching ...

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

Mingao Tan, Yiyang Li, Shanze Wang, Xinming Zhang, Wei Zhang

Abstract

Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.

FSUNav: A Cerebrum-Cerebellum Architecture for Fast, Safe, and Universal Zero-Shot Goal-Oriented Navigation

Abstract

Current vision-language navigation methods face substantial bottlenecks regarding heterogeneous robot compatibility, real-time performance, and navigation safety. Furthermore, they struggle to support open-vocabulary semantic generalization and multimodal task inputs. To address these challenges, this paper proposes FSUNav: a Cerebrum-Cerebellum architecture for fast, safe, and universal zero-shot goal-oriented navigation, which innovatively integrates vision-language models (VLMs) with the proposed architecture. The cerebellum module, a high-frequency end-to-end module, develops a universal local planner based on deep reinforcement learning, enabling unified navigation across heterogeneous platforms (e.g., humanoid, quadruped, wheeled robots) to improve navigation efficiency while significantly reducing collision risk. The cerebrum module constructs a three-layer reasoning model and leverages VLMs to build an end-to-end detection and verification mechanism, enabling zero-shot open-vocabulary goal navigation without predefined IDs and improving task success rates in both simulation and real-world environments. Additionally, the framework supports multimodal inputs (e.g., text, target descriptions, and images), further enhancing generalization, real-time performance, safety, and robustness. Experimental results on MP3D, HM3D, and OVON benchmarks demonstrate that FSUNav achieves state-of-the-art performance on object, instance image, and task navigation, significantly outperforming existing methods. Real-world deployments on diverse robotic platforms further validate its robustness and practical applicability.

Paper Structure

This paper contains 21 sections, 7 equations, 4 figures, 1 table.

Figures (4)

  • Figure 1: The overall framework of FSUNav is shown in the figure. Current vision-language navigation methods still face significant limitations in terms of heterogeneous robot compatibility, real-time performance, and navigation safety, while also struggling to support open-vocabulary semantic generalization and multimodal task inputs. To address these issues, this paper proposes FSUNav, which leverages an efficient dual-brain collaborative architecture to achieve fast, safe, and generalizable zero-shot goal-oriented navigation, delivering comprehensive improvements in generalization capability, real-time performance, safety, and robustness.
  • Figure 2: The overall framework of FSUNav_Cerebrum is shown in the figure. A unified Vision-Language Model (VLM) serves as the core semantic engine across three Cerebrum layers. The Semantic Layer parses multimodal goals into structured target profiles and performs open-vocabulary grounding; the Spatial Layer integrates VLM-driven semantic waypoints with geometry-based frontier exploration for efficient navigation; and the Rule Layer orchestrates behavior via two-stage verification and adaptive cooldown, while also constructing a semantic scene graph. This hierarchical design enables training-free, zero-shot adaptation to heterogeneous goal-oriented navigation tasks.
  • Figure 3: In our real-world experimental setup, we deployed the framework on a Unitree Go2 EDU quadruped robot. A custom 3D-printed mount was used to attach an Intel RealSense D455 camera to the robot for capturing real-time RGB observations.
  • Figure 4: Under a maximum locomotion speed of 0.6 m/s, the quadruped robot successfully completed the open‑vocabulary object goal navigation task targeting “umbrella.” During the experiment, the robot not only demonstrated efficient mobility but also exhibited real‑time dynamic obstacle avoidance capabilities, achieving safe approach and accurate identification of the target object in unstructured environments. Additional task types for the quadruped robot, along with physical experiments involving the humanoid robot G1 and wheeled mobile robots, are planned to be updated in subsequent versions.