Table of Contents
Fetching ...

Deep Reinforcement Learning with Swin Transformers

Li Meng, Morten Goodwin, Anis Yazidi, Paal Engelstad

TL;DR

The paper introduces Swin DQN, an online reinforcement learning algorithm that substitutes CNN backbones with Swin Transformers in Double DQN to leverage local self-attention for high-dimensional visual inputs. By using patch-based tokenization, shifted windowed self-attention, and a hierarchical Swin architecture, it achieves substantial improvements across 49 Atari games, outperforming a CNN-based baseline in maximal and mean returns and improving human-normalized metrics and AUC. The approach demonstrates that spatial attention can meaningfully enhance feature representations in vision-based DRL, albeit at increased computational cost, suggesting directions for lighter Swin variants and broader real-world applications.

Abstract

Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).

Deep Reinforcement Learning with Swin Transformers

TL;DR

The paper introduces Swin DQN, an online reinforcement learning algorithm that substitutes CNN backbones with Swin Transformers in Double DQN to leverage local self-attention for high-dimensional visual inputs. By using patch-based tokenization, shifted windowed self-attention, and a hierarchical Swin architecture, it achieves substantial improvements across 49 Atari games, outperforming a CNN-based baseline in maximal and mean returns and improving human-normalized metrics and AUC. The approach demonstrates that spatial attention can meaningfully enhance feature representations in vision-based DRL, albeit at increased computational cost, suggesting directions for lighter Swin variants and broader real-world applications.

Abstract

Transformers are neural network models that utilize multiple layers of self-attention heads and have exhibited enormous potential in natural language processing tasks. Meanwhile, there have been efforts to adapt transformers to visual tasks of machine learning, including Vision Transformers and Swin Transformers. Although some researchers use Vision Transformers for reinforcement learning tasks, their experiments remain at a small scale due to the high computational cost. This article presents the first online reinforcement learning scheme that is based on Swin Transformers: Swin DQN. In contrast to existing research, our novel approach demonstrate the superior performance with experiments on 49 games in the Arcade Learning Environment. The results show that our approach achieves significantly higher maximal evaluation scores than the baseline method in 45 of all the 49 games (92%), and higher mean evaluation scores than the baseline method in 40 of all the 49 games (82%).
Paper Structure (6 sections, 4 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 6 sections, 4 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: The structure of our DQN. It consists of three convolutional layers, and a fully connected layer, followed by an output layer.
  • Figure 2: The architecture of our Swin DQN. The top shows the step-by-step procedure. The bottom left box contains structures inside a Swin block. The details of patch merging, window partition and window merging are illustrated on the bottom right.
  • Figure 3: The mean evaluation scores together with 95% confidence intervals during 200M frames (50M training steps). The blue line is Swin DQN. The orange line is Double DQN. The interval between every evaluation is 1M frames (250000 steps).
  • Figure 4: The performance profiles up to $\tau=8$ that shows the percentage of the score larger than $\tau$. The blue line is Swin DQN and the orange line is Double DQN.
  • Figure 5: The activation maps of our Double DQN and Swin DQN. All models are trained for 200M frames. The image background is in gray scale. As the color bar suggests, the more yellow the pixel gets, the higher the activation.