Table of Contents
Fetching ...

Cognitive Mapping and Planning for Visual Navigation

Saurabh Gupta, Varun Tolani, James Davidson, Sergey Levine, Rahul Sukthankar, Jitendra Malik

TL;DR

The paper tackles visual navigation in unseen indoor environments by introducing CMP, an end-to-end architecture that jointly learns a latent spatial memory (mapping) and a differentiable planning module. By operating on a multi-scale egocentric belief and using a trainable value-iteration-based planner, CMP handles partial observability and long-horizon planning more robustly than prior reactive or isolated-mapping approaches. Through extensive simulations on real-building scans and a real-world TurtleBot deployment, CMP shows superior performance to baselines across geometric and semantic goals, with notable gains when augmented with larger, diverse training environments. The work demonstrates promising sim-to-real transfer and highlights the importance of integrated mapping-planning with learned representations for scalable, goal-directed navigation.

Abstract

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as 'going to a chair'. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.

Cognitive Mapping and Planning for Visual Navigation

TL;DR

The paper tackles visual navigation in unseen indoor environments by introducing CMP, an end-to-end architecture that jointly learns a latent spatial memory (mapping) and a differentiable planning module. By operating on a multi-scale egocentric belief and using a trainable value-iteration-based planner, CMP handles partial observability and long-horizon planning more robustly than prior reactive or isolated-mapping approaches. Through extensive simulations on real-building scans and a real-world TurtleBot deployment, CMP shows superior performance to baselines across geometric and semantic goals, with notable gains when augmented with larger, diverse training environments. The work demonstrates promising sim-to-real transfer and highlights the importance of integrated mapping-planning with learned representations for scalable, goal-directed navigation.

Abstract

We introduce a neural architecture for navigation in novel environments. Our proposed architecture learns to map from first-person views and plans a sequence of actions towards goals in the environment. The Cognitive Mapper and Planner (CMP) is based on two key ideas: a) a unified joint architecture for mapping and planning, such that the mapping is driven by the needs of the task, and b) a spatial memory with the ability to plan given an incomplete set of observations about the world. CMP constructs a top-down belief map of the world and applies a differentiable neural net planner to produce the next action at each time step. The accumulated belief of the world enables the agent to track visited regions of the environment. We train and test CMP on navigation problems in simulation environments derived from scans of real world buildings. Our experiments demonstrate that CMP outperforms alternate learning-based architectures, as well as, classical mapping and path planning approaches in many cases. Furthermore, it naturally extends to semantically specified goals, such as 'going to a chair'. We also deploy CMP on physical robots in indoor environments, where it achieves reasonable performance, even though it is trained entirely in simulation.

Paper Structure

This paper contains 18 sections, 5 equations, 19 figures, 5 tables.

Figures (19)

  • Figure 1: Top: Network architecture: Our learned navigation network consists of mapping and planning modules. The mapper writes into a latent spatial memory that corresponds to an egocentric map of the environment, while the planner uses this memory alongside the goal to output navigational actions. The map is not supervised explicitly, but rather emerges naturally from the learning process. Bottom: We also describe experiments where we deploy our learned navigation policies on a physical robot.
  • Figure 2: Architecture of the mapper: The mapper module processes first person images from the robot and integrates the observations into a latent memory, which corresponds to an egocentric map of the top-view of the environment. The mapping operation is not supervised explicitly -- the mapper is free to write into memory whatever information is most useful for the planner. In addition to filling in obstacles, the mapper also stores confidence values in the map, which allows it to make probabilistic predictions about unobserved parts of the map by exploiting learned patterns.
  • Figure 3: Architecture of the hierarchical planner: The hierarchical planner takes the egocentric multi-scale belief of the world output by the mapper and uses value iteration expressed as convolutions and channel-wise max-pooling to output a policy. The planner is trainable and differentiable and back-propagates gradients to the mapper. The planner operates at multiple scales (scale 0 is the finest scale) of the problem which leads to efficiency in planning.
  • Figure 4: Geometric Task: We plot the mean distance to goal, 75th percentile distance to goal (lower is better) and success rate (higher is better) as a function of the number of steps. Top row compares the 4 frame reactive agent, LSTM based agent and our proposed CMP based agent when using RGB images as input (left three plots) and when using depth images as input (right three plots). Bottom row compares classical mapping and planning with CMP (again, left is with RGB input and right with depth input). We note that CMP outperforms all these baselines, and using depth input leads to better performance than using RGB input.
  • Figure 5: Semantic Task: We plot the success rate as a function of the number of steps for different categories. Top row compares learning based approaches (4 frame reactive agent, LSTM based agent and our proposed CMP based agent). Bottom row compares a classical approach (using exploration along with semantic segmentation) and CMP. Left plots show performance when using RGB input, right plots show performance with depth input. See text for more details.
  • ...and 14 more figures