FeudalNav: A Simple Framework for Visual Navigation

Faith Johnson; Bryan Bo Cao; Shubham Jain; Ashwin Ashok; Kristin Dana

FeudalNav: A Simple Framework for Visual Navigation

Faith Johnson, Bryan Bo Cao, Shubham Jain, Ashwin Ashok, Kristin Dana

TL;DR

FeudalNav tackles visual navigation in GPS-denied, unseen environments without relying on odometry, RL, or graph-based maps. It introduces a three-tier hierarchy: a high-level memory proxy map (MPM) learned via self-supervised SMoG contrastive learning, a mid-level WayNet for subgoal waypoint generation, and a low-level action module that maps subgoals to simple motor actions. The approach achieves competitive, state-of-the-art-like performance on image-goal navigation in Habitat/Gibson environments while using orders of magnitude less data and compute, and it demonstrates improved results with human-in-the-loop interventions. The work suggests that a compact latent-memory representation, combined with interpretable subgoal planning, can support robust navigation in novel environments and may enable scalable continual-learning adaptations.

Abstract

Visual navigation for robotics is inspired by the human ability to navigate environments using visual cues and memory, eliminating the need for detailed maps. In unseen, unmapped, or GPS-denied settings, traditional metric map-based methods fall short, prompting a shift toward learning-based approaches with minimal exploration. In this work, we develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels. Our method learns to select subgoals through a simple, transferable waypoint selection network. A key component of the approach is a latent-space memory module organized solely by visual similarity, as a proxy for distance. This alternative to graph-based topological representations proves sufficient for navigation tasks, providing a compact, light-weight, simple-to-train navigator that can find its way to the goal in novel locations. We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference. An additional contribution leverages the interpretablility of the framework for interactive navigation. We consider the question: how much direction intervention/interaction is needed to achieve success in all trials? We demonstrate that even minimal human involvement can significantly enhance overall navigation performance.

FeudalNav: A Simple Framework for Visual Navigation

TL;DR

Abstract

Paper Structure (16 sections, 1 equation, 6 figures, 4 tables)

This paper contains 16 sections, 1 equation, 6 figures, 4 tables.

Introduction
Related Work
Feudal Learning
Methods
High-Level Manager: Memory
Mid-Level Manager: Direction
Low-Level Worker: Action
Results
Image-Goal Navigation Task
Training and Testing Procedure
FeudalNav Performance
Feature Comparison
Role of Full Hierarchy
Navigation with Human Feedback
Conclusion
...and 1 more sections

Figures (6)

Figure 1: FeudalNav provides a no-graph, no-odometry, and no-RL visual navigation agent for the image-goal task on previously unseen environments. This simple framework uses a hierarchy that consists of: (1) a high-level manager with a memory proxy map (MPM) that frames memory as a latent space learning problem, (2) a mid-level manager waypoint network (WayNet) mimicking human teleoperation to guide worker agent exploration, and (3) a low-level worker choosing actions in the environment based on the previous layers' subgoals. Optionally, a human-in-the-loop component intervene to improve navigation (see Section \ref{['sec:navhf']}).
Figure 2: Method Overview 1: A subset of trajectories of point-click and observation-image pairs are selected from the LAVN dataset johnson2024landmark for learning a latent space for the memory proxy map and training WayNet. We test our method on a separate set of environments. 2: Images from these pairs are clustered based on feature similarity, and cluster members form positive pairs used for contrastively learning a latent space. 3: The learned latent space is used to build a memory proxy map where the high level manager (HLM) records a history of agent locations. 4: The waypoint network (Waynet) is trained to provide subgoals (points) for navigation based on visual observations, imitating human teleoperation via point-clicks. 5: Based on this point-click guidance and depth map input, the low-level worker predicts to either more forward, left, or right in order to move towards the subgoal (point) and avoid obstacles. 6: During test time, these low level actions guide agent movement and produce new observations as input for the upper levels of the hierarchy.
Figure 3: Illustration of the memory proxy map (MPM) during navigation. Row 1: RGB observation images along a trajectory are shown alongside a diagram of the agent's location in the environment. Colored circles (blue/green) represent the traveled path. Row 2: The MPM with Gaussian-weighted occupancy markers corresponding to each observation image. The map is local, fixed in size, and cropped around the most recent latent map position. This allows the agent to mark locations in latent space (rather than metric space) and recognize repeated observations. Similar observation images cluster together (blue) until a significantly different view appears, forming a new group (green). The MPM provides a graph-free (and interpretable) mechanism for tracking previously visited regions, proving efficient in image-goal navigation by quantifying exploration in different areas of the environment.
Figure 4: (Best viewed zoomed) We show qualitative results for the waypoints predicted by WayNet (blue) shown with the ground truth human click points from the LAVN dataset johnson2024landmark (orange). Note that the majority of the samples show high overlap between the two. When they diverge, the WayNet waypoints still lead to navigably feasible areas in each observation, showing that our network sufficiently learns an acceptable navigation policy.
Figure 5: Heatmaps of distance matrices illustrate metric distances between image pairs in a 450-image trajectory. Left: The ground truth distance matrix (brighter indicates greater distance) represents spatial separation in the simulated environment. Right: Feature-distance matrices (computed via MSE) compare image pairs using features from SmOG pang2022smog, MoCo he2020momentum, ResNet he2016deep, and SwAV caron2020unsupervised. We select SMoG for feature detection as good proxy for metric distance.
...and 1 more figures

FeudalNav: A Simple Framework for Visual Navigation

TL;DR

Abstract

FeudalNav: A Simple Framework for Visual Navigation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)