Table of Contents
Fetching ...

Mind the GAP! The Challenges of Scale in Pixel-based Deep Reinforcement Learning

Ghada Sokar, Pablo Samuel Castro

TL;DR

This work tackles the challenge of scaling pixel-based deep reinforcement learning by identifying the bottleneck between the encoder output $\phi(x)$ and the dense head $\psi$ as the main limiting factor. It shows that prior scaling techniques—such as SoftMoE, tokenization, and pruning—primarily act by restructuring this bottleneck, rather than fundamentally improving representation learning. The authors propose Global Average Pooling (GAP) as a simple, efficient intervention that directly targets the bottleneck, achieving strong performance across scales, architectures, and data regimes, while reducing computational cost relative to more complex methods like SoftMoE. They validate GAP’s generality by demonstrating consistent gains on Atari ALE, Procgen, Atari100K with DER, and SAC on the DMC suite, highlighting a practical path to scalable pixel-based RL and inviting further exploration of bottleneck-aware representation design.

Abstract

Scaling deep reinforcement learning in pixel-based environments presents a significant challenge, often resulting in diminished performance. While recent works have proposed algorithmic and architectural approaches to address this, the underlying cause of the performance drop remains unclear. In this paper, we identify the connection between the output of the encoder (a stack of convolutional layers) and the ensuing dense layers as the main underlying factor limiting scaling capabilities; we denote this connection as the bottleneck, and we demonstrate that previous approaches implicitly target this bottleneck. As a result of our analyses, we present global average pooling as a simple yet effective way of targeting the bottleneck, thereby avoiding the complexity of earlier approaches.

Mind the GAP! The Challenges of Scale in Pixel-based Deep Reinforcement Learning

TL;DR

This work tackles the challenge of scaling pixel-based deep reinforcement learning by identifying the bottleneck between the encoder output and the dense head as the main limiting factor. It shows that prior scaling techniques—such as SoftMoE, tokenization, and pruning—primarily act by restructuring this bottleneck, rather than fundamentally improving representation learning. The authors propose Global Average Pooling (GAP) as a simple, efficient intervention that directly targets the bottleneck, achieving strong performance across scales, architectures, and data regimes, while reducing computational cost relative to more complex methods like SoftMoE. They validate GAP’s generality by demonstrating consistent gains on Atari ALE, Procgen, Atari100K with DER, and SAC on the DMC suite, highlighting a practical path to scalable pixel-based RL and inviting further exploration of bottleneck-aware representation design.

Abstract

Scaling deep reinforcement learning in pixel-based environments presents a significant challenge, often resulting in diminished performance. While recent works have proposed algorithmic and architectural approaches to address this, the underlying cause of the performance drop remains unclear. In this paper, we identify the connection between the output of the encoder (a stack of convolutional layers) and the ensuing dense layers as the main underlying factor limiting scaling capabilities; we denote this connection as the bottleneck, and we demonstrate that previous approaches implicitly target this bottleneck. As a result of our analyses, we present global average pooling as a simple yet effective way of targeting the bottleneck, thereby avoiding the complexity of earlier approaches.

Paper Structure

This paper contains 41 sections, 1 equation, 18 figures, 1 table.

Figures (18)

  • Figure 1: Illustration of the bottleneck in pixel-based networks. Standard dense networks ( Baseline) connect all $\phi$ outputs with $\psi$, resulting in $H\times W\times C\times dim(\psi)$ parameters (scaled down when using pruning, shown in red). SoftMoE-1 converts $\phi$'s outputs into $H\times W$ tokens of dimension $C$; the sharing of learned parameters across tokens results in a bottleneck with $C\times dim(\psi)$ parameters. GAP performs average pooling across $H\times W$ spatial dimensions, resulting in $C$ feature maps and $C\times dim(\psi)$ parameters in the bottleneck.
  • Figure 2: (Left) Distribution of dormant neurons across $\phi$ and $\psi$ in scaled baseline across different games at the end of training. The fully connected layer exhibits the highest percentage of dormancy. (Right) The performance degradation associated with scaling the entire network architecture is comparable to that observed when only the bottleneck is scaled. The performance is aggregated over 20 games.
  • Figure 3: GAP helps improve attention to relevant areas of input. Visualizing influential regions for network decisions using Grad-CAM selvaraju2017grad. (Left) The scaled baseline fails to attend to the important regions, focusing on irrelevant background details. (Right) GAP attends to the important regions in the input.
  • Figure 4: Scaling $\psi$ hinders learning effective combinations of encoder's features, leading to significant performance drop. However, performance dramatically improves when the scaled $\psi$ is fed with higher-level, more abstract features obtain by increasing the depth of $\phi$.
  • Figure 5: (Top) Across different sparse algorithms, sparsification of only $\psi$ yields better performance than sparsifying $\phi$ and $\psi$. (Bottom) The relation between performance and the effective number of parameters in $\psi$ for different approaches. Architectural methods have lower effective density than the baseline which correlates with the observed performance improvements.
  • ...and 13 more figures