Do We Need Transformers to Play FPS Video Games?
Karmanbir Batth, Krish Sethi, Aly Shariff, Leo Shi, Hetul Patel
TL;DR
The paper investigates transformer-based reinforcement learning for FPS gameplay in VizDoom, evaluating online Deep Transformer Q-Networks (DTQN) and offline Decision Transformers (DT) against conventional baselines. DTQN applies a Transformer decoder to sequences of frames to enhance Q-learning under partial observability, while DT treats RL as a supervised sequence generation problem conditioned on return-to-go. Across online and offline settings, results show Transformers do not outperform traditional approaches (DQN+DRQN online, PPO offline) in VizDoom’s memory-intensive tasks, highlighting limitations of self-attention in capturing long-range strategy. The study suggests that future work should explore architectures beyond self-attention, such as selective state-space models (e.g., Decision Mamba), to better handle long-range dependencies in FPS RL and similar environments.
Abstract
In this paper, we explore the Transformer based architectures for reinforcement learning in both online and offline settings within the Doom game environment. Our investigation focuses on two primary approaches: Deep Transformer Q- learning Networks (DTQN) for online learning and Decision Transformers (DT) for offline reinforcement learning. DTQN leverages the sequential modelling capabilities of Transformers to enhance Q-learning in partially observable environments,while Decision Transformers repurpose sequence modelling techniques to enable offline agents to learn from past trajectories without direct interaction with the environment. We conclude that while Transformers might have performed well in Atari games, more traditional methods perform better than Transformer based method in both the settings in the VizDoom environment.
