FeudalNav: A Simple Framework for Visual Navigation
Faith Johnson, Bryan Bo Cao, Shubham Jain, Ashwin Ashok, Kristin Dana
TL;DR
FeudalNav tackles visual navigation in GPS-denied, unseen environments without relying on odometry, RL, or graph-based maps. It introduces a three-tier hierarchy: a high-level memory proxy map (MPM) learned via self-supervised SMoG contrastive learning, a mid-level WayNet for subgoal waypoint generation, and a low-level action module that maps subgoals to simple motor actions. The approach achieves competitive, state-of-the-art-like performance on image-goal navigation in Habitat/Gibson environments while using orders of magnitude less data and compute, and it demonstrates improved results with human-in-the-loop interventions. The work suggests that a compact latent-memory representation, combined with interpretable subgoal planning, can support robust navigation in novel environments and may enable scalable continual-learning adaptations.
Abstract
Visual navigation for robotics is inspired by the human ability to navigate environments using visual cues and memory, eliminating the need for detailed maps. In unseen, unmapped, or GPS-denied settings, traditional metric map-based methods fall short, prompting a shift toward learning-based approaches with minimal exploration. In this work, we develop a hierarchical framework that decomposes the navigation decision-making process into multiple levels. Our method learns to select subgoals through a simple, transferable waypoint selection network. A key component of the approach is a latent-space memory module organized solely by visual similarity, as a proxy for distance. This alternative to graph-based topological representations proves sufficient for navigation tasks, providing a compact, light-weight, simple-to-train navigator that can find its way to the goal in novel locations. We show competitive results with a suite of SOTA methods in Habitat AI environments without using any odometry in training or inference. An additional contribution leverages the interpretablility of the framework for interactive navigation. We consider the question: how much direction intervention/interaction is needed to achieve success in all trials? We demonstrate that even minimal human involvement can significantly enhance overall navigation performance.
