Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

Ghada Sokar; Johan Obando-Ceron; Aaron Courville; Hugo Larochelle; Pablo Samuel Castro

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

Ghada Sokar, Johan Obando-Ceron, Aaron Courville, Hugo Larochelle, Pablo Samuel Castro

TL;DR

The paper investigates why SoftMoEs improve online reinforcement learning, revealing that tokenizing the convolutional encoder outputs—rather than simply increasing the number of experts—is the dominant factor behind performance gains. Through a series of controlled experiments, it shows that combined tokenization preserves spatial structure and can match or exceed the benefits of multiple experts, even with a single scaled expert. These findings challenge the default practice of flattening encoder outputs and suggest broader implications for pixel-based RL architectures and expert utilization strategies. The work demonstrates robustness across multiple agents, encoders, and environments, highlighting tokenization as a key design principle for scalable, efficient RL with MoEs and guiding future research toward better utilization of expert capacity.

Abstract

The use of deep neural networks in reinforcement learning (RL) often suffers from performance degradation as model size increases. While soft mixtures of experts (SoftMoEs) have recently shown promise in mitigating this issue for online RL, the reasons behind their effectiveness remain largely unknown. In this work we provide an in-depth analysis identifying the key factors driving this performance gain. We discover the surprising result that tokenizing the encoder output, rather than the use of multiple experts, is what is behind the efficacy of SoftMoEs. Indeed, we demonstrate that even with an appropriately scaled single expert, we are able to maintain the performance gains, largely thanks to tokenization.

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

TL;DR

Abstract

Paper Structure (44 sections, 24 figures, 1 table)

This paper contains 44 sections, 24 figures, 1 table.

Introduction
Online Deep Reinforcement Learning
Mixtures of Experts
Tokenization
Understanding the impact of the SoftMoE components
Experimental setup
Analysis of SoftMoE Components
Processing combined tokens
Expert specialization
Expert width
Network depth
Multiple experts
Don't flatten, tokenize!
Tokenized baseline
Single expert processing all tokens sparsely
...and 29 more sections

Figures (24)

Figure 1: Comparison of IQM agarwal2021deep for Rainbow hessel2018rainbow, using the Impala architecture espeholt2018impala, with penultimate layer scaled, and with SoftMoE with varying numbers of experts, and a single, scaled, expert. IQM scores computed over 200M environment steps across 20 games, with 5 independent runs each, higher is better. Error bars represent 95% stratified bootstrap confidence intervals. A single scaled expert matches the performance of multiple experts.
Figure 2: The various architectures considered in this work.Top: baseline architecture; Middle: SoftMoE architecture; Bottom: Top-$k$ architecture.
Figure 3: Tokenization schemes considered in this work, specified in the bottom row.
Figure 4: Understanding the impact of SoftMoE components. Using Rainbow with the Impala architecture as our base, we investigate key aspects of SoftMoE-4: (i) combining tokens, (ii) expert specialization, and (iii) adjusting architectural dimensions. Reporting IQM agarwal2021deep, where higher is better. SoftMoE does not appear to suffer from the performance degradation observed in the baseline, even when increasing its width.
Figure 5: Evaluating impact of tokenizing Rainbow-litewith unscaled (top) and scaled (bottom) architectures, using sum/mean over convolutional features, and exploring both PerConv and PerFeat tokenization schemes. Reporting Median, IQM, Mean, and Optimaly Gap scores agarwal2021deep. Higher is better for all except Optimality Gap. Tokenization with a scaled representation can yield significant significant performance gains. Per-game results available in Appendix \ref{['sec:perGameResults']}.
...and 19 more figures

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

TL;DR

Abstract

Don't flatten, tokenize! Unlocking the key to SoftMoE's efficacy in deep RL

Authors

TL;DR

Abstract

Table of Contents

Figures (24)