Table of Contents
Fetching ...

CompassNav: Steering From Path Imitation To Decision Understanding In Navigation

LinFeng Li, Jian Zhao, Yuan Xie, Xin Tan, Xuelong Li

TL;DR

CompassNav reframes embodied navigation from pure path imitation to decision understanding by pairing a dense, action-level Compass-Data-22k dataset with a Gap-Aware Hybrid Reward within a two-stage SFT-then-RFT training regime. The approach yields an internal compass capable of evaluating all candidate moves, enabling superior generalization and robust sim-to-real performance on a $7$B LVLM base. Empirical results show state-of-the-art results on ObjectNav benchmarks and robust real-world deployment on a mobile robot, outperforming larger proprietary models with markedly lower data requirements. The work highlights the value of offline, dense supervision and adaptive reward design for efficient, decision-focused embodied agents and opens avenues for integrating external memory systems without sacrificing policy robustness.

Abstract

The dominant paradigm for training Large Vision-Language Models (LVLMs) in navigation relies on imitating expert trajectories. This approach reduces the complex navigation task to a sequence-to-sequence replication of a single correct path, fundamentally limiting the agent's ability to explore and generalize. In this work, we argue for and introduce a new paradigm: a shift from Path Imitation to Decision Understanding. The goal of this paradigm is to build agents that do not just follow, but truly understand how to navigate. We materialize this through two core contributions: first, we introduce Compass-Data-22k, a novel 22k-trajectory dataset.Its Reinforcement Fine-Tuning (RFT) subset provides a panoramic view of the decision landscape by annotating all feasible actions with A* geodesic distances. Second, we design a novel gap-aware hybrid reward function that dynamically adapts its feedback to decision certainty, shifting between decisive signals for optimal actions and nuanced scores to encourage exploration. Integrated into an SFT-then-RFT recipe, our CompassNav agent is trained not to memorize static routes, but to develop an internal ``compass'' that constantly intuits the direction to the goal by evaluating the relative quality of all possible moves. This approach enables our 7B agent to set a new state-of-the-art on Goal navigation benchmarks, outperforming even larger proprietary models, and achieve robust real-world goal navigation on a physical robot.

CompassNav: Steering From Path Imitation To Decision Understanding In Navigation

TL;DR

CompassNav reframes embodied navigation from pure path imitation to decision understanding by pairing a dense, action-level Compass-Data-22k dataset with a Gap-Aware Hybrid Reward within a two-stage SFT-then-RFT training regime. The approach yields an internal compass capable of evaluating all candidate moves, enabling superior generalization and robust sim-to-real performance on a B LVLM base. Empirical results show state-of-the-art results on ObjectNav benchmarks and robust real-world deployment on a mobile robot, outperforming larger proprietary models with markedly lower data requirements. The work highlights the value of offline, dense supervision and adaptive reward design for efficient, decision-focused embodied agents and opens avenues for integrating external memory systems without sacrificing policy robustness.

Abstract

The dominant paradigm for training Large Vision-Language Models (LVLMs) in navigation relies on imitating expert trajectories. This approach reduces the complex navigation task to a sequence-to-sequence replication of a single correct path, fundamentally limiting the agent's ability to explore and generalize. In this work, we argue for and introduce a new paradigm: a shift from Path Imitation to Decision Understanding. The goal of this paradigm is to build agents that do not just follow, but truly understand how to navigate. We materialize this through two core contributions: first, we introduce Compass-Data-22k, a novel 22k-trajectory dataset.Its Reinforcement Fine-Tuning (RFT) subset provides a panoramic view of the decision landscape by annotating all feasible actions with A* geodesic distances. Second, we design a novel gap-aware hybrid reward function that dynamically adapts its feedback to decision certainty, shifting between decisive signals for optimal actions and nuanced scores to encourage exploration. Integrated into an SFT-then-RFT recipe, our CompassNav agent is trained not to memorize static routes, but to develop an internal ``compass'' that constantly intuits the direction to the goal by evaluating the relative quality of all possible moves. This approach enables our 7B agent to set a new state-of-the-art on Goal navigation benchmarks, outperforming even larger proprietary models, and achieve robust real-world goal navigation on a physical robot.

Paper Structure

This paper contains 36 sections, 9 equations, 11 figures, 6 tables, 1 algorithm.

Figures (11)

  • Figure 1: A visualization of our proposed paradigm shift from Path Imitation to Decision Understanding. (Bottom Left) The prevailing Path Imitation paradigm treats navigation as the replication of a single expert path (solid line), penalizing all other choices. (Top Left) In contrast, our Decision Understanding paradigm teaches the agent to evaluate the relative value of all alternative paths (dashed line), enabling flexible judgment at critical decision points. (Right) A concrete instantiation of these concepts in a navigation scenario.
  • Figure 2: Overview of our data generation and formulation pipeline. (1) Data Generation: We use an A* planner in habitat-sim to densely annotate all feasible actions with their distance to the goal. (2) Dataset Formulation: Trajectories are formatted into two types. SFT data (bottom right) contains a teacher's reasoning trace for imitation. RFT data (top right) contains the full vector of action distances for reward modeling.
  • Figure 3: The CompassNav two-stage training pipeline. In Stage 1 (SFT), a pretrained VLM is fine-tuned to imitate a teacher's "reason-then-act" output. In Stage 2 (RFT), the SFT-tuned policy generates multiple responses, which are scored by our Gap-Aware Hybrid Reward function.
  • Figure 4: A comparative analysis of our Gap-Aware hybrid Reward against Binary and Min-Max schemes across three representative navigation scenarios. Each scenario shows the reward assigned for choosing the best, second-best, and worst actions.
  • Figure 5: Left heatmaps showing the reward gap between the best and second-best actions under High and Low Certainty scenarios. Right training reward curves for different reward functions.
  • ...and 6 more figures