Table of Contents
Fetching ...

Neural Map: Structured Memory for Deep Reinforcement Learning

Emilio Parisotto, Ruslan Salakhutdinov

TL;DR

The paper tackles memory in partially observable deep reinforcement learning by introducing Neural Map, a spatially structured 2D external memory with location-aligned, sparse writes. It defines differentiable read/write/update operations, including global and context-based reads and a local write, with variants such as key-value addressing and GRU-based writes. Empirical results show Neural Map outperforming LSTM and MemNN baselines in 2D mazes and achieving strong generalization in a 3D Doom environment, especially when using GRU-based updates; an ego-centric extension further removes reliance on absolute pose. The work provides a practical, end-to-end trainable memory architecture that scales to complex 3D environments and offers useful inductive biases for spatial navigation in DRL.

Abstract

A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

Neural Map: Structured Memory for Deep Reinforcement Learning

TL;DR

The paper tackles memory in partially observable deep reinforcement learning by introducing Neural Map, a spatially structured 2D external memory with location-aligned, sparse writes. It defines differentiable read/write/update operations, including global and context-based reads and a local write, with variants such as key-value addressing and GRU-based writes. Empirical results show Neural Map outperforming LSTM and MemNN baselines in 2D mazes and achieving strong generalization in a 3D Doom environment, especially when using GRU-based updates; an ego-centric extension further removes reliance on absolute pose. The work provides a practical, end-to-end trainable memory architecture that scales to complex 3D environments and offers useful inductive biases for spatial navigation in DRL.

Abstract

A critical component to enabling intelligent reasoning in partially observable environments is memory. Despite this importance, Deep Reinforcement Learning (DRL) agents have so far used relatively simple memory architectures, with the main methods to overcome partial observability being either a temporal convolution over the past k frames or an LSTM layer. More recent work (Oh et al., 2016) has went beyond these architectures by using memory networks which can allow more sophisticated addressing schemes over the past k frames. But even these architectures are unsatisfactory due to the reason that they are limited to only remembering information from the last k frames. In this paper, we develop a memory system with an adaptable write operator that is customized to the sorts of 3D environments that DRL agents typically interact with. This architecture, called the Neural Map, uses a spatially structured 2D memory image to learn to store arbitrary information about the environment over long time lags. We demonstrate empirically that the Neural Map surpasses previous DRL memories on a set of challenging 2D and 3D maze environments and show that it is capable of generalizing to environments that were not seen during training.

Paper Structure

This paper contains 17 sections, 10 equations, 6 figures, 2 tables.

Figures (6)

  • Figure 1: A visualization of two time steps of the neural map.
  • Figure 2: Images showing the 2D maze environments. The left side (Fig. \ref{['fig:randmaze']}) represents the fully observable maze while the right side (Fig. \ref{['fig:randmaze_obs']}) represents the agent observations. The agent is represented by the yellow pixel with its orientation indicated by the black arrow within the yellow block. The starting position is always the topmost position of the maze. The red bounding box represents the area of the maze that is subsampled for the agent observation. In "Goal-Search", the goal of the agent is to find a certain color block (either red or teal), where the correct color is provided by an indicator (either green or blue). This indicator has a fixed position near the start position of the agent.
  • Figure 3: Training curves for all 4 agent architectures on the "Goal-Search" environment. The x-axis is an epoch (250k concurrent steps) and the y-axis is the average undiscounted episode return. The curves show that the GRU-based Neural Map learns faster and is more stable than the standard update Neural Map.
  • Figure 4: A few sampled states from an example episode demonstrating how the agent learns to use the context addressing operation of the Neural Map. The top row of images is the observations made by the agent, the center is the fully observable mazes and the bottom image is the probability distributions over locations induced by the context operation at that step.
  • Figure 5: Images of the Doom maze environment. The agent starts in the middle of a maze looking in the direction of a torch indicator. The torch can be either green (top-left image) or red (bottom-left image) and indicates which of the goals to search for. The goals are two towers which are randomly located within the maze and match the indicator color. The episode ends whenever the agent touches a tower, whereupon it receives a positive reward if it reached the correct tower, while a negative reward otherwise. Alternatively, the episode is also terminated if the agent has not reached a tower in 420 steps.
  • ...and 1 more figures