Table of Contents
Fetching ...

Multi-Modal Cognitive Maps based on Neural Networks trained on Successor Representations

Paul Stoewer, Achim Schilling, Andreas Maier, Patrick Krauss

TL;DR

A multi-modal neural network is set up using successor representations which is able to model place cell dynamics and cognitive map representations, suggesting that large language models like ChatGPT could harness similar architectures to function as a high-level processing center, akin to how the hippocampus operates within the cortex hierarchy.

Abstract

Cognitive maps are a proposed concept on how the brain efficiently organizes memories and retrieves context out of them. The entorhinal-hippocampal complex is heavily involved in episodic and relational memory processing, as well as spatial navigation and is thought to built cognitive maps via place and grid cells. To make use of the promising properties of cognitive maps, we set up a multi-modal neural network using successor representations which is able to model place cell dynamics and cognitive map representations. Here, we use multi-modal inputs consisting of images and word embeddings. The network learns the similarities between novel inputs and the training database and therefore the representation of the cognitive map successfully. Subsequently, the prediction of the network can be used to infer from one modality to another with over $90\%$ accuracy. The proposed method could therefore be a building block to improve current AI systems for better understanding of the environment and the different modalities in which objects appear. The association of specific modalities with certain encounters can therefore lead to context awareness in novel situations when similar encounters with less information occur and additional information can be inferred from the learned cognitive map. Cognitive maps, as represented by the entorhinal-hippocampal complex in the brain, organize and retrieve context from memories, suggesting that large language models (LLMs) like ChatGPT could harness similar architectures to function as a high-level processing center, akin to how the hippocampus operates within the cortex hierarchy. Finally, by utilizing multi-modal inputs, LLMs can potentially bridge the gap between different forms of data (like images and words), paving the way for context-awareness and grounding of abstract concepts through learned associations, addressing the grounding problem in AI.

Multi-Modal Cognitive Maps based on Neural Networks trained on Successor Representations

TL;DR

A multi-modal neural network is set up using successor representations which is able to model place cell dynamics and cognitive map representations, suggesting that large language models like ChatGPT could harness similar architectures to function as a high-level processing center, akin to how the hippocampus operates within the cortex hierarchy.

Abstract

Cognitive maps are a proposed concept on how the brain efficiently organizes memories and retrieves context out of them. The entorhinal-hippocampal complex is heavily involved in episodic and relational memory processing, as well as spatial navigation and is thought to built cognitive maps via place and grid cells. To make use of the promising properties of cognitive maps, we set up a multi-modal neural network using successor representations which is able to model place cell dynamics and cognitive map representations. Here, we use multi-modal inputs consisting of images and word embeddings. The network learns the similarities between novel inputs and the training database and therefore the representation of the cognitive map successfully. Subsequently, the prediction of the network can be used to infer from one modality to another with over accuracy. The proposed method could therefore be a building block to improve current AI systems for better understanding of the environment and the different modalities in which objects appear. The association of specific modalities with certain encounters can therefore lead to context awareness in novel situations when similar encounters with less information occur and additional information can be inferred from the learned cognitive map. Cognitive maps, as represented by the entorhinal-hippocampal complex in the brain, organize and retrieve context from memories, suggesting that large language models (LLMs) like ChatGPT could harness similar architectures to function as a high-level processing center, akin to how the hippocampus operates within the cortex hierarchy. Finally, by utilizing multi-modal inputs, LLMs can potentially bridge the gap between different forms of data (like images and words), paving the way for context-awareness and grounding of abstract concepts through learned associations, addressing the grounding problem in AI.
Paper Structure (18 sections, 7 equations, 7 figures)

This paper contains 18 sections, 7 equations, 7 figures.

Figures (7)

  • Figure 1: A schematic representation of a cognitive map with different kinds of animals. Each animal can be represented by a feature vector holding all possible information, like overall characteristics, location or time of the encounter, visual representation or any information worth storing. The objects in the map are connected to all other objects, and the weight of the connection determines the similarity of the objects. Similar objects therefore form clusters. The varying map scaling in the entorhinal-hippocampal complex collin2015memory enables the to zoom in and out of the categories like in our examples from over all animals, to mammals to cats. A cognitive map can give therefore context information for novel inputs, by providing information stored in the map of similar encounters in the past.
  • Figure 2: The neural network represented as graph diagram. The left arm of the network receives the digits images with dimensionality 24x24 as input and propagates them through a 10-layered convolutional neural network. The right arm receives word embeddings with dimensionality 300 as input, which propagate through a 5-layered fully connected multi-layer perceptron. Subsequently, the extracted features (i.e. outputs) of both arms are concatenated and serve as input for a 6-layered fully connected neural network with a final softmax layer for the output. The dimensionality of the softmax layer depends on the number of training samples used.
  • Figure 3: Schematic representation of the interpolation process. The predicted vector from the neural network $p_1$ gets multiplied with the training data, called Memory Matrix $M$, which gives the interpolated memory trace $m_I$.
  • Figure 4: Visualization of Predictive Patterns by Trained Neural Networks on Different Input Modalities. Neural networks trained with 1,000 images are used to generate predictions across various input types, which are then represented as Multidimensional Scaling (MDS) graphs. A: MDS representation for combined input modalities, illustrating distinct clusters corresponding to each digit. B: Predictions based solely on word embeddings, showing clustering for each digit. Unlike other modalities, these clusters are not arranged in a circular order but are more dispersed within the feature space. C: Using only test images as input, the resulting clusters resemble those in A, but are positioned more closely to one another.
  • Figure 5: Training Performance of a Network with 1,000 Training Images. A: Training with the SR-Matrix as the label shows promising accuracy, reaching approximately 40% after 1,000 epochs, alongside a rapid decrease in loss. However, the validation accuracy remains near 0%, and the loss initially increases before plateauing. B: Utilizing the TP-Matrix as the training and testing label yields varying outcomes. The training accuracy and loss trends are similar to those observed in A. Notably, the validation accuracy climbs to about 5%, and the loss shows a continuous decline.
  • ...and 2 more figures