Unifying Map and Landmark Based Representations for Visual Navigation
Saurabh Gupta, David Fouhey, Sergey Levine, Jitendra Malik
TL;DR
This work tackles robust visual navigation under actuation noise using a unified, learned framework that combines map-based planning with landmark-based execution from sparse views. It introduces four components: a mapper that builds an allocentric map from limited registered images, a learned value-iteration style path planner, a feature synthesizer that generates visual anchors along planned routes, and a closed-loop controller that follows a path signature $Ξ(p)$ comprising $(a^p_j, ρ^p_j, \hat{f^p_j})$ with attention over the trajectory. The approach is trained end-to-end and demonstrated in simulated reconstructions of real indoor spaces, showing improved navigation performance and robustness over baselines, especially with limited visual input. This framework leverages priors from similar environments and enables differentiable joint optimization across mapping, planning, and execution, offering a scalable path toward more reliable autonomous navigation in real-world settings.
Abstract
This works presents a formulation for visual navigation that unifies map based spatial reasoning and path planning, with landmark based robust plan execution in noisy environments. Our proposed formulation is learned from data and is thus able to leverage statistical regularities of the world. This allows it to efficiently navigate in novel environments given only a sparse set of registered images as input for building representations for space. Our formulation is based on three key ideas: a learned path planner that outputs path plans to reach the goal, a feature synthesis engine that predicts features for locations along the planned path, and a learned goal-driven closed loop controller that can follow plans given these synthesized features. We test our approach for goal-driven navigation in simulated real world environments and report performance gains over competitive baseline approaches.
