Table of Contents
Fetching ...

OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation

Xinda Xue, Junjun Hu, Minghua Luo, Xie Shichao, Jintao Chen, Zixun Xie, Quan Kuichen, Guo Wei, Mu Xu, Zedong Chu

TL;DR

OmniNav addresses the challenge of unified embodied navigation by coupling a fast, flow-matching policy that predicts continuous waypoints with a slow, frontier-based planner that leverages long-horizon memory and explicit chain-of-thought reasoning. It operates across instruct-goal, object-goal, point-goal, and frontier exploration within a single architecture, supported by a two-stage training regime that blends discrete language–vision data with continuous control. The approach achieves state-of-the-art performance on multiple benchmarks and demonstrates real-world deployment at up to 5 Hz, highlighting robust generalization and practical utility for versatile robotic navigation. By integrating large-scale generic vision–language data and a central memory with multimodal tokens, OmniNav provides a scalable path toward highly generalizable embodied intelligence in dynamic environments.

Abstract

Embodied navigation presents a core challenge for intelligent robots, requiring the comprehension of visual environments, natural language instructions, and autonomous exploration. Existing models often fall short in offering a unified solution across diverse navigation paradigms, resulting in low success rates and limited generalization. We introduce OmniNav, a unified framework addressing instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. Our approach features a lightweight, low-latency policy that accurately predicts continuous-space waypoints (coordinates and orientations). This policy surpasses action-chunk methods in precision and supports real-world deployment at control frequencies up to 5 Hz. Architecturally, OmniNav employs a fast-slow system design: a fast module generates waypoints using short-horizon visual context and subtasks, while a slow module performs deliberative planning with long-horizon observations and candidate frontiers to select subsequent subgoals and subtasks. This collaboration enhances path efficiency and maintains trajectory coherence, particularly in exploration and memory-intensive scenarios. Crucially, we identify that the primary bottleneck isn't merely navigation policy learning, but a robust understanding of general instructions and objects. To boost generalization, OmniNav integrates large-scale, general-purpose training datasets, including those for image captioning and visual recognition, into a joint multi-task regimen. This significantly improves success rates and robustness. Extensive experiments confirm OmniNav's state-of-the-art performance across various navigation benchmarks, with real-world deployment further validating its efficacy. OmniNav provides practical insights for embodied navigation, charting a scalable path towards versatile, highly generalizable robotic intelligence.

OmniNav: A Unified Framework for Prospective Exploration and Visual-Language Navigation

TL;DR

OmniNav addresses the challenge of unified embodied navigation by coupling a fast, flow-matching policy that predicts continuous waypoints with a slow, frontier-based planner that leverages long-horizon memory and explicit chain-of-thought reasoning. It operates across instruct-goal, object-goal, point-goal, and frontier exploration within a single architecture, supported by a two-stage training regime that blends discrete language–vision data with continuous control. The approach achieves state-of-the-art performance on multiple benchmarks and demonstrates real-world deployment at up to 5 Hz, highlighting robust generalization and practical utility for versatile robotic navigation. By integrating large-scale generic vision–language data and a central memory with multimodal tokens, OmniNav provides a scalable path toward highly generalizable embodied intelligence in dynamic environments.

Abstract

Embodied navigation presents a core challenge for intelligent robots, requiring the comprehension of visual environments, natural language instructions, and autonomous exploration. Existing models often fall short in offering a unified solution across diverse navigation paradigms, resulting in low success rates and limited generalization. We introduce OmniNav, a unified framework addressing instruct-goal, object-goal, point-goal navigation, and frontier-based exploration within a single architecture. Our approach features a lightweight, low-latency policy that accurately predicts continuous-space waypoints (coordinates and orientations). This policy surpasses action-chunk methods in precision and supports real-world deployment at control frequencies up to 5 Hz. Architecturally, OmniNav employs a fast-slow system design: a fast module generates waypoints using short-horizon visual context and subtasks, while a slow module performs deliberative planning with long-horizon observations and candidate frontiers to select subsequent subgoals and subtasks. This collaboration enhances path efficiency and maintains trajectory coherence, particularly in exploration and memory-intensive scenarios. Crucially, we identify that the primary bottleneck isn't merely navigation policy learning, but a robust understanding of general instructions and objects. To boost generalization, OmniNav integrates large-scale, general-purpose training datasets, including those for image captioning and visual recognition, into a joint multi-task regimen. This significantly improves success rates and robustness. Extensive experiments confirm OmniNav's state-of-the-art performance across various navigation benchmarks, with real-world deployment further validating its efficacy. OmniNav provides practical insights for embodied navigation, charting a scalable path towards versatile, highly generalizable robotic intelligence.

Paper Structure

This paper contains 9 sections, 4 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: The fast thinking system can independently handle multi-task navigation, using a VLM backbone and a flow-matching policy to rapidly generate waypoints. Building on this, a slow thinking module is integrated to enable long-term memory and planning: it constructs long-range spatial and semantic memory using frontiers and images, and provides subgoal cues. The reasoning process is briefly summaried as: if the target already exists in memory, the fast thinking module is invoked to reach it; otherwise, the system selects the most appropriate subgoal for next exploration.
  • Figure 2: Reasoning process by the slow system for exploration. For the “find the bathtub” task, the model reasons over the frontier set using memory and semantic priors, iteratively generating subgoals for the next exploration.
  • Figure 3: Data composition overview. Four data types are used for training: Navigation task data, Embodied Q&A data, General MLLM data and Grounding and referring data.
  • Figure 4: Real-world deployment. It shows third-person view of the three different navigation tasks which are deployed in a zero-shot setting. The gradient blue arrows indicate the trajectory, and the yellow box marks the target location. Our model demonstrates highly effective navigation performance on the real quadruped robot.