Spatially-Aware Transformer for Embodied Agents
Junmo Cho, Jaesik Yoon, Sungjin Ahn
TL;DR
The paper addresses the limitation of transformer-based episodic memory which typically lacks explicit spatial context. It introduces Spatially-Aware Transformers (SAT) with place-centric memory and a Hierarchical Read mechanism, enhanced by the Adaptive Memory Allocator (AMA) to select memory-writing strategies conditioned on task goals. Empirically, SAT and AMA improve spatial reasoning, memory efficiency, and generalization across supervised prediction, action-conditioned image generation, and reinforcement learning tasks, including Room Ballet and FFHQ-based environments. The work demonstrates that explicit spatial encoding in episodic memory, together with adaptable memory management, yields robust performance across diverse embodied AI tasks, and it provides reproducible code and detailed experimental settings for further study.
Abstract
Episodic memory plays a crucial role in various cognitive processes, such as the ability to mentally recall past events. While cognitive science emphasizes the significance of spatial context in the formation and retrieval of episodic memory, the current primary approach to implementing episodic memory in AI systems is through transformers that store temporally ordered experiences, which overlooks the spatial dimension. As a result, it is unclear how the underlying structure could be extended to incorporate the spatial axis beyond temporal order alone and thereby what benefits can be obtained. To address this, this paper explores the use of Spatially-Aware Transformer models that incorporate spatial information. These models enable the creation of place-centric episodic memory that considers both temporal and spatial dimensions. Adopting this approach, we demonstrate that memory utilization efficiency can be improved, leading to enhanced accuracy in various place-centric downstream tasks. Additionally, we propose the Adaptive Memory Allocator, a memory management method based on reinforcement learning that aims to optimize efficiency of memory utilization. Our experiments demonstrate the advantages of our proposed model in various environments and across multiple downstream tasks, including prediction, generation, reasoning, and reinforcement learning. The source code for our models and experiments will be available at https://github.com/junmokane/spatially-aware-transformer.
