Table of Contents
Fetching ...

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

Rachel S. Y. Teo, Tan M. Nguyen

TL;DR

It is theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE, and it is shown that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

Abstract

Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model's lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

MomentumSMoE: Integrating Momentum into Sparse Mixture of Experts

TL;DR

It is theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE, and it is shown that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

Abstract

Sparse Mixture of Experts (SMoE) has become the key to unlocking unparalleled scalability in deep learning. SMoE has the potential to exponentially increase parameter count while maintaining the efficiency of the model by only activating a small subset of these parameters for a given sample. However, it has been observed that SMoE suffers from unstable training and has difficulty adapting to new distributions, leading to the model's lack of robustness to data contamination. To overcome these limitations, we first establish a connection between the dynamics of the expert representations in SMoEs and gradient descent on a multi-objective optimization problem. Leveraging our framework, we then integrate momentum into SMoE and propose a new family of SMoEs named MomentumSMoE. We theoretically prove and numerically demonstrate that MomentumSMoE is more stable and robust than SMoE. In particular, we verify the advantages of MomentumSMoE over SMoE on a variety of practical tasks including ImageNet-1K object recognition and WikiText-103 language modeling. We demonstrate the applicability of MomentumSMoE to many types of SMoE models, including those in the Sparse MoE model for vision (V-MoE) and the Generalist Language Model (GLaM). We also show that other advanced momentum-based optimization methods, such as Adam, can be easily incorporated into the MomentumSMoE framework for designing new SMoE models with even better performance, almost negligible additional computation cost, and simple implementations.

Paper Structure

This paper contains 49 sections, 3 theorems, 44 equations, 14 figures, 12 tables.

Key Result

Lemma 1

Given the matrix ${\bm{A}} = $ and $\lambda_1({\bm{A}})$, $\lambda_2({\bm{A}})$ are eigenvalues of $A$, $\max\{|\lambda_1({\bm{A}})|,|\lambda_2({\bm{A}})|\}<1$ if and only if $\mu\in (-1,1)$ and $\gamma \sigma(n)\in (0,2+2\mu)$.

Figures (14)

  • Figure 1: Illustration of SMoE (Left) and MomentumSMoE layer (Right). We establish a connection between Multiple-Gradient Descent and SMoE to introduce momentum into the model, leading to better accuracy, enhanced robustness, and faster convergence.
  • Figure 2: Average output norms at layers 1 and 6 of the MoE/SMoE during 80 training epochs on WikiText-103.
  • Figure 3: Average output norm at each layer across 1K train/validation samples of the (S)MoE trained on WikiText-103.
  • Figure 4: Left: WikiText-103 train/validation perplexity (PPL) curves during the first 5 training epochs for MomentumSMoE, AdamSMoE, and SMoE. AdamSMoE has significantly faster convergence compared to SMoE. Right: Training loss/top-1 accuracy (%) of Momentum-Soft MoE vs. Soft MoE baseline on ImageNet-1K across 120 epochs of training. Momentum-Soft MoE has faster convergence and improved accuracy.
  • Figure 5: Left: Proportion of each expert chosen, ordered from the largest norm of each expert output to the smallest norm, in layers 3 and 5 of SMoE, MomentumSMoE, and Adam SMoE, averaged over the WikiText-103 validation set. Right: Log validation perplexity (PPL) during the finetuning of hyperparameters, $\mu$ and $\gamma$, for 40 training epochs in MomentumSMoE. When tuning $\gamma$, we keep $\mu=0.7$ and vice versa with $\gamma=1.0$.
  • ...and 9 more figures

Theorems & Definitions (5)

  • Definition 1: Pareto-stationary
  • Lemma 1
  • Proposition 1: Convergence of MomentumSMoE
  • Corollary 1: MomentumSMoE is more stable than SMoE
  • Remark 1