Table of Contents
Fetching ...

Feudal Networks for Visual Navigation

Faith Johnson, Bryan Bo Cao, Ashwin Ashok, Shubham Jain, Kristin Dana

TL;DR

FeudalNav tackles image-goal navigation in unseen environments without relying on reinforcement learning, odometry, graphs, or metric maps. It introduces a hierarchical feudal architecture with a self-supervised memory proxy map and a waypoint predictor (WayNet) that imitates human teleoperation to guide local exploration. The approach achieves near-state-of-the-art performance on image-goal tasks in Habitat Gibson environments using a publicly released 103K-human teleoperation dataset and a lean training setup. This modular, no-graph framework fosters robust transfer to new environments and highlights the value of learned memory and waypoint guidance for scalable navigation.

Abstract

Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high-level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.

Feudal Networks for Visual Navigation

TL;DR

FeudalNav tackles image-goal navigation in unseen environments without relying on reinforcement learning, odometry, graphs, or metric maps. It introduces a hierarchical feudal architecture with a self-supervised memory proxy map and a waypoint predictor (WayNet) that imitates human teleoperation to guide local exploration. The approach achieves near-state-of-the-art performance on image-goal tasks in Habitat Gibson environments using a publicly released 103K-human teleoperation dataset and a lean training setup. This modular, no-graph framework fosters robust transfer to new environments and highlights the value of learned memory and waypoint guidance for scalable navigation.

Abstract

Visual navigation follows the intuition that humans can navigate without detailed maps. A common approach is interactive exploration while building a topological graph with images at nodes that can be used for planning. Recent variations learn from passive videos and can navigate using complex social and semantic cues. However, a significant number of training videos are needed, large graphs are utilized, and scenes are not unseen since odometry is utilized. We introduce a new approach to visual navigation using feudal learning, which employs a hierarchical structure consisting of a worker agent, a mid-level manager, and a high-level manager. Key to the feudal learning paradigm, agents at each level see a different aspect of the task and operate at different spatial and temporal scales. Two unique modules are developed in this framework. For the high-level manager, we learn a memory proxy map in a self supervised manner to record prior observations in a learned latent space and avoid the use of graphs and odometry. For the mid-level manager, we develop a waypoint network that outputs intermediate subgoals imitating human waypoint selection during local navigation. This waypoint network is pre-trained using a new, small set of teleoperation videos that we make publicly available, with training environments different from testing environments. The resulting feudal navigation network achieves near SOTA performance, while providing a novel no-RL, no-graph, no-odometry, no-metric map approach to the image goal navigation task.
Paper Structure (22 sections, 1 equation, 7 figures, 3 tables)

This paper contains 22 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Feudal Navigation Network (FeudalNav), providing a no-RL, no-odometry, no-graph, and no-metric map visual navigation agent for the image-goal task on previously unseen environments. The three main components are: (1) a high level manager that creates a memory proxy map (MPM) to use as an aggregate observation to make high-level navigation decisions, (2) a mid-level manager waypoint network (WayNet) that mimics human teleoperation by predicting visible points in the environment to guide worker agent exploration, and (3) a low level worker that finds the subgoal using robust point matching.
  • Figure 2: Samples of RGB images in our human navigation dataset captured from an ego-centric camera in both Gibson xiazamirhe2018gibsonenv and Matterport Matterport3D environments. Video frames start from left to right. A human provides point-click guidance for robot visual navigation visualized with red dots.
  • Figure 3: Stacked histograms depicting $\#$Frames and $\#$LM per trajectory. (a) the majority of trajectories consists of the maximum $\#$Frames 500. The number of human point-click waypoints is equal to the number of frames for each scene. (b) most trajectories in the Gibson rooms consists of only a small number of annotated landmarks (e.g. $<25$), while it increases in Matterport due to its larger and more complicated environments. (LM denotes landmark.)
  • Figure 4: 1: Point click data is collected while human teleoperators direct agent exploration of different environments. The resulting set of point-image pairs comprise the human navigation dataset. 2: From this dataset, we find clusters of observations based on feature similarity. 3: These clusters are used to provide positive pairs to train the navigation memory module that serves as the high level manager (HLM) for our navigation agent. During test time, this HLM creates a map of historical agent locations (memory proxy map) in the learned space. 4: These maps are created for the human navigation dataset and used to train the mid-level manager. During training, the memory proxy map and the current observations are used to predict human-like point click supervision to guide environment exploration. 5: Based on this point click guidance, the worker executes low level actions directly in the simulated environment. (See Figure \ref{['fig:workerActions']}) 6: During test time, these low level actions guide agent movement and produce new observations as input for the upper levels of the hierarchy.
  • Figure 5: Map showing mid-level manager point click locations map to which simulator actions for the low level worker. The bottom left grid (blue) and the bottom right grid (red) correspond to "turn left" and 11turn right" respectively. The center middle grid (yellow) corresponds to "move forward". The top left grid (blue and yellow) and the top right grid (red and yellow) correspond to the joint action of turning (left for blue and right for red) and moving forward sequentially. The two vertical delineations are at 0.4 and 0.6 times the x dimension of the observation respectively. The horizontal line is drawn at 0.8 time the y dimension of the observation.
  • ...and 2 more figures