Table of Contents
Fetching ...

Mixture of Experts in a Mixture of RL settings

Timon Willi, Johan Obando-Ceron, Jakob Foerster, Karolina Dziugaite, Pablo Samuel Castro

TL;DR

Problem: DRL under non-stationarity often suffers from plasticity loss and inefficient parameter usage. Approach: this paper evaluates Mixtures of Experts (MoEs) across multi-task RL (MTRL) and continual RL (CRL) with various MoE architectures and routing strategies, using PPO on multiple Atari-like MinAtar tasks. Key findings: SoftMoE with a Big architecture reduces dormant neurons, improves learning under CRL, and partially benefits MTRL; router learning shows mixed results with hardcoded routing sometimes outperforming learned routing, and environment order significantly affecting CRL. Significance: provides practical guidelines for integrating MoEs in actor-critic DRL and suggests curricula or multi-agent MoE extensions as fruitful future directions.

Abstract

Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.

Mixture of Experts in a Mixture of RL settings

TL;DR

Problem: DRL under non-stationarity often suffers from plasticity loss and inefficient parameter usage. Approach: this paper evaluates Mixtures of Experts (MoEs) across multi-task RL (MTRL) and continual RL (CRL) with various MoE architectures and routing strategies, using PPO on multiple Atari-like MinAtar tasks. Key findings: SoftMoE with a Big architecture reduces dormant neurons, improves learning under CRL, and partially benefits MTRL; router learning shows mixed results with hardcoded routing sometimes outperforming learned routing, and environment order significantly affecting CRL. Significance: provides practical guidelines for integrating MoEs in actor-critic DRL and suggests curricula or multi-agent MoE extensions as fruitful future directions.

Abstract

Mixtures of Experts (MoEs) have gained prominence in (self-)supervised learning due to their enhanced inference efficiency, adaptability to distributed training, and modularity. Previous research has illustrated that MoEs can significantly boost Deep Reinforcement Learning (DRL) performance by expanding the network's parameter count while reducing dormant neurons, thereby enhancing the model's learning capacity and ability to deal with non-stationarity. In this work, we shed more light on MoEs' ability to deal with non-stationarity and investigate MoEs in DRL settings with "amplified" non-stationarity via multi-task training, providing further evidence that MoEs improve learning capacity. In contrast to previous work, our multi-task results allow us to better understand the underlying causes for the beneficial effect of MoE in DRL training, the impact of the various MoE components, and insights into how best to incorporate them in actor-critic-based DRL networks. Finally, we also confirm results from previous work.

Paper Structure

This paper contains 21 sections, 1 equation, 34 figures, 26 tables.

Figures (34)

  • Figure 1: Architectures considered: (a) Baseline architecture; (b)$\textrm{Middle}$, used by obando2024mixtures; (c)$\textrm{Final}$, where an MoE module replaces the final layer; (d)$\textrm{All}$, where all layers are replaced with an MoE module; (e)$\textrm{Big}$, with a single MoE module where an expert comprises the full original network.
  • Figure 2: Measuring the impact of MoE architectures with hardcoded routing in MTRL (top). and CRL (bottom). In each legend, the numbers in parentheses indicate the average performance of each approach over all games. $\textrm{Big}$ outperforms all other methods.
  • Figure 3: Measuring the impact of routing with $\textrm{Big}$ architecture using different routing approaches under the MTRL (top row) and CRL (bottom row) settings. In each legend, the numbers in parentheses indicate the average performance of each approach across all games. SoftMoE and Hardcoded work best in MTRL, and Hardcoded works best in CRL, though SoftMoE still outperforms the baseline.
  • Figure 4: Top: Adding the task ID as an input to the router hurts performance for Big-SoftMoE. Bottom left: Sequential gradient similarity calculated throughout training, where dashed vertical lines represent when tasks switch. Bottom right: Adding gradient information as an input to the router does not improve performance.
  • Figure 5: Top: presents the ratio of dormant neurons for CRL under different routing approaches using $\textrm{Big}$. The numbers in the legend represent average dormant neuron fractions across all games. MoE variants have lower dormant neurons than the baseline. Bottom: Regularising the entropy of the router makes the expert selection more uniform. Without regularisation, there is more specialisation. This shows one seed, as different seeds might choose different experts. See \ref{['sec:analysis']} for more details.
  • ...and 29 more figures