Table of Contents
Fetching ...

Spatially-Aware Transformer for Embodied Agents

Junmo Cho, Jaesik Yoon, Sungjin Ahn

TL;DR

The paper addresses the limitation of transformer-based episodic memory which typically lacks explicit spatial context. It introduces Spatially-Aware Transformers (SAT) with place-centric memory and a Hierarchical Read mechanism, enhanced by the Adaptive Memory Allocator (AMA) to select memory-writing strategies conditioned on task goals. Empirically, SAT and AMA improve spatial reasoning, memory efficiency, and generalization across supervised prediction, action-conditioned image generation, and reinforcement learning tasks, including Room Ballet and FFHQ-based environments. The work demonstrates that explicit spatial encoding in episodic memory, together with adaptable memory management, yields robust performance across diverse embodied AI tasks, and it provides reproducible code and detailed experimental settings for further study.

Abstract

Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.

Spatially-Aware Transformer for Embodied Agents

TL;DR

The paper addresses the limitation of transformer-based episodic memory which typically lacks explicit spatial context. It introduces Spatially-Aware Transformers (SAT) with place-centric memory and a Hierarchical Read mechanism, enhanced by the Adaptive Memory Allocator (AMA) to select memory-writing strategies conditioned on task goals. Empirically, SAT and AMA improve spatial reasoning, memory efficiency, and generalization across supervised prediction, action-conditioned image generation, and reinforcement learning tasks, including Room Ballet and FFHQ-based environments. The work demonstrates that explicit spatial encoding in episodic memory, together with adaptable memory management, yields robust performance across diverse embodied AI tasks, and it provides reproducible code and detailed experimental settings for further study.

Abstract

Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.
Paper Structure (38 sections, 2 equations, 22 figures, 5 tables, 3 algorithms)

This paper contains 38 sections, 2 equations, 22 figures, 5 tables, 3 algorithms.

Figures (22)

  • Figure 1: Home robot thought experiments. In scenario (a), a robot visits the environment in a different temporal order. When it arrives at Room D, answering what happened to the room on its left would not be easy using only the temporal axis. In scenario (b), the robot stays in Room B for a long time. With a FIFO memory, the agent would delete the memory of Room C. When it returns, it will perceive Room C as new.
  • Figure 2: Illustrations of FIFO, Place Memory, and the Adaptive Memory Allocator (AMA). The orange memory is the oldest and originates from a different place compared to the green and blue ones. In FIFO, the orange memory is dropped due to its age. However, the Place Memory system retains it through place-wise memory allocation. AMA chooses the appropriate allocation strategy, $\pi^\tau_{\text{AMA}}$, based on the given downstream task knowledge, denoted as $\tau$.
  • Figure 3: The Room Ballet environment.
  • Figure 4: (a) The performance comparison of transformer memories equipped with different embeddings on the spatial reasoning task. (b) The performance comparison of unstructured memory and temporally/spatially structured memory on a spatial reasoning task in a large environment. (c) The performance comparison of the AMA model and predefined memory allocation strategies on a spatial reasoning task.
  • Figure 5: The accuracy of different task-strategy pairs.
  • ...and 17 more figures