Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Tianlong Chen; Zhenyu Zhang; Ajay Jaiswal; Shiwei Liu; Zhangyang Wang

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Tianlong Chen, Zhenyu Zhang, Ajay Jaiswal, Shiwei Liu, Zhangyang Wang

TL;DR

This work tackles the high training cost and representational collapse of gigantic transformers by introducing SMoE-Dropout, a plug-and-play training framework that uses a fixed random router and progressively increases the number of active experts to scale model capacity without collapse. By modularizing MLPs into multiple small experts and gradually enriching the active subset, SMoE-Dropout yields a self-slimmable property: performance improves smoothly as more experts are activated during inference or fine-tuning. Extensive experiments across Transformer-XL, BERT, and RoBERTa show superior pre-training efficiency and downstream transfer gains compared to dense and other SMoE baselines, with notable reductions in training time and robust scalability. The approach avoids learning routing policies, mitigates representation collapse, and provides a practical once-for-all capacity control aligned with available resources, suggesting broad applicability to large-scale transformer modeling and beyond.

Abstract

Despite their remarkable achievement, gigantic transformers encounter significant drawbacks, including exorbitant computational and memory footprints during training, as well as severe collapse evidenced by a high degree of parameter redundancy. Sparsely-activated Mixture-of-Experts (SMoEs) have shown promise to mitigate the issue of training efficiency, yet they are prone to (1) redundant experts due to representational collapse; and (2) poor expert scalability for inference and downstream fine-tuning, primarily due to overfitting of the learned routing policy to the number of activated experts during training. As recent research efforts are predominantly focused on improving routing policies to encourage expert specializations, this work focuses on exploring the overlooked scalability bottleneck of SMoEs and leveraging it to effectively scale dense transformers. To this end, we propose a new plug-and-play training framework, SMoE-Dropout, to enable scaling transformers to better accuracy in their full capacity without collapse. Specifically, SMoE-Dropout consists of a randomly initialized and fixed router network to activate experts and gradually increases the activated expert number as training progresses over time. Transformers trained by SMoE-Dropout naturally exhibit a self-slimmable property subject to resource availability, offering smooth and consistent performance boosts with an increase in activated experts during inference or fine-tuning. Our extensive experiments demonstrate the superior performance and substantial computation savings of SMoE-Dropout, compared to dense training baselines with equivalent parameter counts. In particular, our trained BERT outperforms its densely trained counterpart with consistent improvements of {1.03%, 0.78%, 1.09%} on challenging reasoning tasks {ASDiv-A, MAWPS, SVAMP}, respectively.

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

TL;DR

Abstract

Paper Structure (34 sections, 1 equation, 5 figures, 6 tables, 2 algorithms)

This paper contains 34 sections, 1 equation, 5 figures, 6 tables, 2 algorithms.

Introduction
Related Works
Mixture of Experts (MoE).
Dropout and Other Training Techniques for Transformers in NLP.
Methodology
Preliminary
Sparse Mixture of Experts (SMoEs).
Dropout and its variants.
A New Training Pipeline: SMoE-Dropout
Modulization.
Random Routing Policy.
SMoE-Dropout.
Experiment
Implementation Details
Network Architectures and Comparison Baselines.
...and 19 more sections

Figures (5)

Figure 1: Bits-Per-Character ($\downarrow$) on enwik8's test-set with a $4$-layer Transformer-XL. SMoE-Dropout demonstrates a "self-slimmable” property where inference performance is smoothly boosted along with the increase of activated parameters. Learnable SMoEs tend to overfit certain levels of network capacity. Note that only gray curve is produced by ($5$) different dense models.
Figure 2: Overview of our proposed SMoE-Dropout. Left describes the standard transformer layer, consisting of multi-head attention and multi-layer perceptron (MLP) components. Middle Left shows the process of modulization. It splits the original MLP evenly and constructs a series of experts which are smaller MLPs with a reduced hidden dimension. Middle Right presents the overall procedure of SMoE-Dropout. The random router selects the top-$k$ experts given a token embedding and then reweights the features from activated experts. In the end, a summation is conducted to aggregate all features. Right displays the gradually increased number of chosen experts, along with the training procedure.
Figure 3: Testing performance over # parameter counts of {Transformer-XL, BERT, RoBERTa} networks on {enwik8, BookCorpus, BookCorpus} datasets, respectively. A smaller BPC suggests a better model.
Figure 4: Transfer performance over # parameter counts of {Transformer-XL, BERT, RoBERTa} networks on downstream {SST-2, CSQA, ASDiv-A, MAWPS, SVAMP} datasets, respectively. Only the fine-tuning of Dense w. Dropout needs multiple pre-trained models with different amounts of network capacity.
Figure 5: Extra studies about SMoE-Dropout. Testing BPC of Transformer-XL is collected on enwik8. ($a$) and ($b$) investigate diverse training mechanisms under different model depths and widths, respectively. ($c$) is the ablation of random routing policies. ($d$) examines the effects of gradually increased $k$. ($e$) studies the appropriate locations to insert SMoE expert layers.

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

TL;DR

Abstract

Sparse MoE as the New Dropout: Scaling Dense and Self-Slimmable Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)