Table of Contents
Fetching ...

Memory Proxy Maps for Visual Navigation

Faith Johnson, Bryan Bo Cao, Ashwin Ashok, Shubham Jain, Kristin Dana

TL;DR

The paper addresses visual navigation in unseen environments without odometry, graphs, or reinforcement learning by proposing a three-tier feudal architecture that relies on a self-supervised Memory Proxy Map (MPM) as a memory proxy. It introduces a high-level memory manager (MPM), a mid-level waypoint generator (WayNet) trained via human point-click demonstrations, and a low-level action module that maps depth and WayNet waypoints to discrete actions. The approach achieves state-of-the-art performance on image-goal navigation in Gibson Habitat environments with significantly reduced data and without simulators, odometry, or graph-based planning. The work highlights the viability of memory-based, hierarchical navigation for unseen environments and points toward efficient, continual-learning-ready deployment in real-world scenarios.

Abstract

Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

Memory Proxy Maps for Visual Navigation

TL;DR

The paper addresses visual navigation in unseen environments without odometry, graphs, or reinforcement learning by proposing a three-tier feudal architecture that relies on a self-supervised Memory Proxy Map (MPM) as a memory proxy. It introduces a high-level memory manager (MPM), a mid-level waypoint generator (WayNet) trained via human point-click demonstrations, and a low-level action module that maps depth and WayNet waypoints to discrete actions. The approach achieves state-of-the-art performance on image-goal navigation in Gibson Habitat environments with significantly reduced data and without simulators, odometry, or graph-based planning. The work highlights the viability of memory-based, hierarchical navigation for unseen environments and points toward efficient, continual-learning-ready deployment in real-world scenarios.

Abstract

Visual navigation takes inspiration from humans, who navigate in previously unseen environments using vision without detailed environment maps. Inspired by this, we introduce a novel no-RL, no-graph, no-odometry approach to visual navigation using feudal learning to build a three tiered agent. Key to our approach is a memory proxy map (MPM), an intermediate representation of the environment learned in a self-supervised manner by the high-level manager agent that serves as a simplified memory, approximating what the agent has seen. We demonstrate that recording observations in this learned latent space is an effective and efficient memory proxy that can remove the need for graphs and odometry in visual navigation tasks. For the mid-level manager agent, we develop a waypoint network (WayNet) that outputs intermediate subgoals, or waypoints, imitating human waypoint selection during local navigation. For the low-level worker agent, we learn a classifier over a discrete action space that avoids local obstacles and moves the agent towards the WayNet waypoint. The resulting feudal navigation network offers a novel approach with no RL, no graph, no odometry, and no metric map; all while achieving SOTA results on the image goal navigation task.

Paper Structure

This paper contains 18 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Method Overview 1: A subset of trajectories of point-click and observation-image pairs are selected from the LAVN dataset johnson2024landmark for learning a latent space for the memory proxy map and training WayNet. We test our method on a separate set of environments. 2: Images from these pairs are clustered based on feature similarity, and cluster members form positive pairs used for contrastively learning a latent space. 3: The learned latent space is used to build a memory proxy map where the high level manager (HLM) records a history of agent locations. 4: The waypoint network (Waynet) is trained to provide subgoals (points) for navigation based on visual observations, imitating human teleoperation via point-clicks. 5: Based on this point-click guidance and depth map input, the low-level worker predicts to either more forward, left, or right in order to move towards the subgoal (point) and avoid obstacles. 6: During test time, these low level actions guide agent movement and produce new observations as input for the upper levels of the hierarchy.
  • Figure 2: Illustration of the memory proxy map (MPM) during navigation. Row 1: RGB observation images along a trajectory are shown with a diagram of the agent's corresponding location in an environment. The colored circles (blue/green) represent the traveled path. Row 2: The MPM with guassian-weighted occupancy markers corresponding to each observation image. The map is local, of fixed size, and cropped around the most recently added latent map position. In this manner, the agent marks locations in the latent space (not a metric space) corresponding to recently viewed images, thus remembering when observations repeat. Similar observation images cluster together (in blue) until the next view is significantly different in appearance and a new group begins (in green). The MPM is a convenient no-graph mechanism to remember previously visited regions, effective and efficient in the image-goal navigation task to quantify the amount of exploration in a given area of the environment.
  • Figure 3: (Best viewed zoomed) We show qualitative results for the waypoints predicted by WayNet (blue) shown with the ground truth human click points from the LAVN dataset johnson2024landmark (orange). Note that the majority of the samples show high overlap between the two. When they diverge, the WayNet waypoints still lead to navigably feasible areas in each observation, showing that our network sufficiently learns an acceptable navigation policy.
  • Figure 4: Distance matrices showing a heatmap of metric distances between each pair of images in a single trajectory (450 image sequence). (Left) Ground truth distance matrix (brighter is farther away) of the locations of each image in the simulated environment. Compare this to each of the image feature-distance matrices (computed using MSE) for the same images pairs using features from the following (from left to right after GT distance): SMoG pang2022smog, Mocohe2020momentum, Resnet he2016deep, and Swav caron2020unsupervised. The more the feature distances resemble the ground truth metric distances, the more effective the features will be at encoding a proxy for relative distance between real world observations. We choose SMoG for feature detection in our high-level manager because its feature distance matrix most closely resembles the ground truth distances heatmap.