Table of Contents
Fetching ...

Unifying Map and Landmark Based Representations for Visual Navigation

Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik

TL;DR

This work tackles robust visual navigation under actuation noise using a unified, learned framework that combines map-based planning with landmark-based execution from sparse views. It introduces four components: a mapper that builds an allocentric map from limited registered images, a learned value-iteration style path planner, a feature synthesizer that generates visual anchors along planned routes, and a closed-loop controller that follows a path signature $Ξ(p)$ comprising $(a^p_j, ρ^p_j, \hat{f^p_j})$ with attention over the trajectory. The approach is trained end-to-end and demonstrated in simulated reconstructions of real indoor spaces, showing improved navigation performance and robustness over baselines, especially with limited visual input. This framework leverages priors from similar environments and enables differentiable joint optimization across mapping, planning, and execution, offering a scalable path toward more reliable autonomous navigation in real-world settings.

Abstract

This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.

Unifying Map and Landmark Based Representations for Visual Navigation

TL;DR

This work tackles robust visual navigation under actuation noise using a unified, learned framework that combines map-based planning with landmark-based execution from sparse views. It introduces four components: a mapper that builds an allocentric map from limited registered images, a learned value-iteration style path planner, a feature synthesizer that generates visual anchors along planned routes, and a closed-loop controller that follows a path signature comprising with attention over the trajectory. The approach is trained end-to-end and demonstrated in simulated reconstructions of real indoor spaces, showing improved navigation performance and robustness over baselines, especially with limited visual input. This framework leverages priors from similar environments and enables differentiable joint optimization across mapping, planning, and execution, offering a scalable path toward more reliable autonomous navigation in real-world settings.

Abstract

This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.

Paper Structure

This paper contains 22 sections, 5 equations, 8 figures, 2 tables.

Figures (8)

  • Figure 1: Problem Setup: Given a set of reference images and poses, the starting image and pose, and a goal pose, we want a policy $\pi$ that is able to convey the robot from its current pose to the target pose using first person RGB image observations under noisy actuation.
  • Figure 2: Overview: Given a small set of registered RGB images as input, our approach for visual navigation conveys the robot to the target location using two central components: a joint mapping and planning module that learns to plan paths and find shortcuts given the sparse set of registered views; and a joint path signature and execution module that executes a given path.
  • Figure 3: Mapping Architecture: Given an input image and pose $\theta_i$, our mapping architecture passes it through a neural network composed of $\psi$, $g$, and $\phi$, producing an egocentric map $m^{ego}$ from the camera view. This map is transformed according to $\theta_i$ using a spatial transformer to produce an allocentric map $m^{allo}$ in the world frame. The allocentric maps are accumulated by $\Sigma_m$ and upconvolved by $U$ to produce the final map.
  • Figure 4: Feature Synthesis and Policy Execution: Given a path, we produce a signature consisting of a sequence of tuples $(a_i, \rho_i, \hat{f}_i)$., These tuples are softly attended to by a path executing policy $\pi$ which in sequence determines what action to take.
  • Figure 5: Results: Far left: Average Precision for free space prediction as a function of the number of mapping images when the mapper module is trained in isolation for the task of free space prediction. Center left: Success rate for getting to the goal using the joint mapper and planner architecture to output open loop plans varying number of registered images as input. Center right: Average precision for localization of novel locations in space given varying number of reference images as input. Far right: Distribution of distance to goal after for following a plan to get to a goal that is 32 to 36 time steps away with reference images coming randomly from around the trajectory. Best viewed in color, see relevant text for details.
  • ...and 3 more figures