Deep Transformer Q-Networks for Partially Observable Reinforcement Learning
Kevin Esslinger, Robert Platt, Christopher Amato
TL;DR
The paper tackles partial observability in reinforcement learning by introducing DTQN, a transformer decoder-based Q-network that encodes an agent's history through self-attention and learns positional encodings. It trains with an intermediate Q-value prediction objective, enabling supervision from Q-values across all timesteps in the history and improving learning stability. Across multiple POMDP-like domains, DTQN outperforms or matches strong baselines (DRQN, DQN, ATTN) with faster learning and higher final performance, while also providing interpretable attention visualizations. The work demonstrates the viability of transformer-based history models for partial observability and provides a modular implementation to serve as a benchmark for future transformer-based RL methods.
Abstract
Real-world reinforcement learning tasks often involve some form of partial observability where the observations only give a partial or noisy view of the true state of the world. Such tasks typically require some form of memory, where the agent has access to multiple past observations, in order to perform well. One popular way to incorporate memory is by using a recurrent neural network to access the agent's history. However, recurrent neural networks in reinforcement learning are often fragile and difficult to train, susceptible to catastrophic forgetting and sometimes fail completely as a result. In this work, we propose Deep Transformer Q-Networks (DTQN), a novel architecture utilizing transformers and self-attention to encode an agent's history. DTQN is designed modularly, and we compare results against several modifications to our base model. Our experiments demonstrate the transformer can solve partially observable tasks faster and more stably than previous recurrent approaches.
