Table of Contents
Fetching ...

Mixtures of Experts Unlock Parameter Scaling for Deep RL

Johan Obando-Ceron, Ghada Sokar, Timon Willi, Clare Lyle, Jesse Farebrother, Jakob Foerster, Gintare Karolina Dziugaite, Doina Precup, Pablo Samuel Castro

TL;DR

This paper demonstrates that integrating Soft Mixtures of Experts into value-based deep RL networks significantly improves parameter scalability, enabling larger models to perform better without destabilizing training. By replacing the penultimate layer with a Soft MoE, the authors observe consistent gains across DQN and Rainbow on extensive Atari benchmarks, with gains scaling with the number of experts and robustness to high replay ratios. They provide in-depth analyses of tokenization, gating, and encoder choices, showing that learned routing and the accompanying gating/combining components drive the improvements, while MoEs also stabilize optimization as evidenced by NTK rank and reduced dormant neurons. The work extends beyond online evaluation by showing promise in offline RL and low-data regimes, highlighting MoEs as a practical route toward establishing parameter-scale laws in reinforcement learning.

Abstract

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.

Mixtures of Experts Unlock Parameter Scaling for Deep RL

TL;DR

This paper demonstrates that integrating Soft Mixtures of Experts into value-based deep RL networks significantly improves parameter scalability, enabling larger models to perform better without destabilizing training. By replacing the penultimate layer with a Soft MoE, the authors observe consistent gains across DQN and Rainbow on extensive Atari benchmarks, with gains scaling with the number of experts and robustness to high replay ratios. They provide in-depth analyses of tokenization, gating, and encoder choices, showing that learned routing and the accompanying gating/combining components drive the improvements, while MoEs also stabilize optimization as evidenced by NTK rank and reduced dormant neurons. The work extends beyond online evaluation by showing promise in offline RL and low-data regimes, highlighting MoEs as a practical route toward establishing parameter-scale laws in reinforcement learning.

Abstract

The recent rapid progress in (self) supervised learning models is in large part predicted by empirical scaling laws: a model's performance scales proportionally to its size. Analogous scaling laws remain elusive for reinforcement learning domains, however, where increasing the parameter count of a model often hurts its final performance. In this paper, we demonstrate that incorporating Mixture-of-Expert (MoE) modules, and in particular Soft MoEs (Puigcerver et al., 2023), into value-based networks results in more parameter-scalable models, evidenced by substantial performance increases across a variety of training regimes and model sizes. This work thus provides strong empirical evidence towards developing scaling laws for reinforcement learning.
Paper Structure (33 sections, 2 equations, 20 figures, 4 tables)

This paper contains 33 sections, 2 equations, 20 figures, 4 tables.

Figures (20)

  • Figure 1: The use of Mixture of Experts allows the performance of DQN (top) and Rainbow (bottom) to scale with an increased number of parameters. While $\textrm{Soft MoE}$ helps in both cases and improves with scale, $\textrm{Top1-MoE}$ only helps in Rainbow, and does not improve with scale. The corresponding layer in the baseline is scaled by the number of experts to (approximately) match parameters. IQM scores computed over 200M environment steps over 20 games, with 5 independent runs each, and error bars showing 95% stratified bootstrap confidence intervals. The replay ratio is fixed to the standard $0.25$.
  • Figure 2: Incorporating MoE modules into deep RL networks.Top left: Baseline architecture; bottom left: Baseline with penultimate layer scaled up; right: Penultimate layer replaced with an MoE module.
  • Figure 3: Tokenization types considered: PerConv (per convolution), PerFeat (per feature), and PerSamp (per sample).
  • Figure 4: $\textrm{Soft MoE}$ yields performance gains even at high replay ratio values. DQN (left) and Rainbow (right) with 8 experts. See \ref{['sec:setup']} for training details.
  • Figure 5: Scaling down the dimensionality of $\textrm{Soft MoE}$ experts has no significant impact on performance in Rainbow. See \ref{['sec:setup']} for training details.
  • ...and 15 more figures