Table of Contents
Fetching ...

Automated mapping of virtual environments with visual predictive coding

James Gornet, Matthew Thomson

TL;DR

This work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.

Abstract

Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.

Automated mapping of virtual environments with visual predictive coding

TL;DR

This work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.

Abstract

Humans construct internal cognitive maps of their environment directly from sensory inputs without access to a system of explicit coordinates or distance measurements. While machine learning algorithms like SLAM utilize specialized visual inference procedures to identify visual features and construct spatial maps from visual and odometry data, the general nature of cognitive maps in the brain suggests a unified mapping algorithmic strategy that can generalize to auditory, tactile, and linguistic inputs. Here, we demonstrate that predictive coding provides a natural and versatile neural network algorithm for constructing spatial maps using sensory data. We introduce a framework in which an agent navigates a virtual environment while engaging in visual predictive coding using a self-attention-equipped convolutional neural network. While learning a next image prediction task, the agent automatically constructs an internal representation of the environment that quantitatively reflects distances. The internal map enables the agent to pinpoint its location relative to landmarks using only visual information.The predictive coding network generates a vectorized encoding of the environment that supports vector navigation where individual latent space units delineate localized, overlapping neighborhoods in the environment. Broadly, our work introduces predictive coding as a unified algorithmic framework for constructing cognitive maps that can naturally extend to the mapping of auditory, sensorimotor, and linguistic inputs.
Paper Structure (15 sections, 2 theorems, 26 equations, 11 figures)

This paper contains 15 sections, 2 theorems, 26 equations, 11 figures.

Table of Contents

  1. Supplementary Information

Key Result

Theorem 1

Consider an environment $X$---a closed subset of the lattice $\mathbb{Z}^2$ with a function $x \xmapsto{f} I$ that gives an image $I_x = f(x) \subset \mathbb{R}^D$ for each position $x \in X$. Let the environment's observations be degenerate such that There exists no decoder $I \xmapsto{d} x$ that satisfies

Figures (11)

  • Figure 1: A predictive coding neural network explores a virtual environment. In predictive coding, a model predicts observations and updates its parameters using the prediction error. a, an agent's traverses its environment by taking the most direct path to random positions. b, a self-attention-based encoder-decoder neural network architecture learns to perform predictive coding. A ResNet-18 convolutional neural network acts as an encoder; self-attention is performed with 8 heads, and a corresponding ResNet-18 convolutional neural network performing decoding to the predicted image.c, the neural network learns to perform predictive coding effectively---with a mean-squared error of 0.094 between the actual and predicted images.
  • Figure 2: Predictive coding neural network constructs an implicit spatial map.a-b, The predictive coder's latent space encodes accurate spatial positions. A neural network predicts the spatial location from the predictive coding’s latent space. a, a heatmap of the prediction errors between the actual position and the predictive coder's predicted positions show a low prediction error. b, The histogram of prediction errors of positions from the predictive coder's latent space show a low prediction error. As a baseline (Noise model ($\sigma = 1$ lattice unit)), actual positions with a small noise displacement gives an error model. c, predictive coding’s latent distances recover the environment’s spatial metric. Sequential visual images are mapped to the neural network's latent space, and the latent space distances ($\ell_2$) are plotted with physical distances onto a joint density plot. An nonlinear regression model $\left\lVert z - z' \right\rVert = \alpha \log \left\lVert x - x' \right\rVert + \beta$is shown as a baseline. d, a correlation plot and a quantile-quantile plot show the overlap between the empirical and model distributions.
  • Figure 3: Predictive coding network learns spatial proximity not image similaritya, an autoencoding neural network compresses visual images into a low-dimensional latent vector and reconstructs the image from the latent space. Auto-encoder trains on visual images from the environment without any sequential order. b-c, auto-encoding encodes lower resolution in positional information. b, a neural network predicts the spatial location from the auto-encoding’s latent space. A heatmap of the prediction errors between the actual position and the auto-encoder's predicted positions show a higher prediction error---compared to the predictive coder. c, auto-encoding captures less positional information compared to predictive coding. The histogram shows the prediction errors of positions from the latent space of both the auto-encoder and the predictive coder. d, latent distances, however, show a weaker relationship with physical distances, as the joint histogram between physical and latent distances is less concentrated. e, a correlation plot and a quantile-quantile plot show a lower correlation and a lower density overlap between the empirical and model distributions. f, predictive coding’s latent units communicate more fine-grained spatial distances whereas auto-encoding communicates broad spatial regions. Joint density plots show the association between latent distances and physical distances for both predictive coding and auto-encoding. Predictive coding’s latent distances increase with spatial distances, with a higher concentration compared to auto-encoding.
  • Figure 4: Predictive coding network can learn a circular topology and distinguishes visually identical, spatially different locationsa, an agent traverses a circular environment with two visually identical red rooms, which provides visually similar yet spatially different locations. b, the predictive coder's latent distances show a correspondence with the circular environment's metric while the auto-encoder's latent distances show little correlation. c, similar to Figures 2 and 3, a different neural network measures the predictive coder's spatial information by predicting the agent's location from the predictive coder's latent space. The predictive coder's latent space demonstrates a low prediction error. d, similar to Figures 2 and 3, the nonlinear regression measures the correspondence between the latent distances $\left\lVert z - z' \right\rVert$ and the actual distances $\left\lVert x - x' \right\rVert$ with the model $\left\lVert z - z' \right\rVert = \alpha \log \left\lVert x - x' \right\rVert + \beta.$d, the correlation plot (left) with the nonlinear regression model show a strong correlation between the predictive coder's latent distances and the environment's actual distances ($r = 0.827$). The quantile-quantile plot (right) between the predictive coder's latent distances and the regression model show high overlap ($\mathbb{D}_\text{KL}(p_\text{PC} \Vert p_\text{model}) = 0.250$). e, without any past information, the auto-encoder cannot distinguish the two different red rooms and produces a high prediction error in these locations. f, the correlation plot (left) with the nonlinear regression model show little correlation between the auto-encoder's latent distances and the environment's actual distances ($r = 0.288$). The quantile-quantile plot (right) between the auto-encoder's latent distances and the regression model show little overlap ($\mathbb{D}_\text{KL}(p_\text{PC} \Vert p_\text{model}) = 3.806$).
  • Figure 5: The predictive coding network generates place fields that support vector based distance calculationsa, when encoding past images for predictive coding, the self-attention module generates latent vectors. Each continuous unit in these latent vectors activates in concentrated, localized regions in physical space. These continuous units can be thresholded to generate a binary vector determining whether each unit is active. Each latent unit covers a unique region, and each physical location gives a unique combination of these overlapping regions. As an agent moves away from its original location, the combination of overlapping regions gradually deviates from its original combinations. This deviation, as measured by Hamming distance, correlates with physical distance. b, distance is given by the difference in the latent units' overlapping regions. Two nearby locations have small deviations in overlap (right) while two distant locations have large deviations (middle). c, latent units are spatially organized into localized regions. The active latent units are approximated by a two-dimensional Gaussian distribution to measure the latent unit's localization (top). The latent units' Gaussian approximations are highly localized with a mean area of $254.6$ for densities $p \geq 0.0005$. d, latent units distributed across the environment. The number of latent units was calculated as each lattice block in the environment (left), and the number of lattice blocks were calculated for each active unit (right). The latent units provide a unique combination for $87.6\%$ of the environment, and their aggregate covers the entire environment. e, distance from the region overlap captures most of the predictive coder's spatial information. We calculate the distance for every pair of active latent vectors and their respective physical Euclidean distances as a joint distribution. The proposed mechanism captures a majority of the predictive coder's spatial information---as the proposed mechanism's mutual information (0.542 bits) compares to the predictive coder's mutual information (0.627 bits)
  • ...and 6 more figures

Theorems & Definitions (4)

  • Theorem 1
  • proof
  • Corollary 1
  • proof