DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Maryam Akhavan Aghdam; Hongpeng Jin; Yanzhao Wu

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Maryam Akhavan Aghdam, Hongpeng Jin, Yanzhao Wu

TL;DR

This study proposes a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure and demonstrates that the DA-MoE approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.

Abstract

Transformer-based Mixture-of-Experts (MoE) models have been driving several recent technological advancements in Natural Language Processing (NLP). These MoE models adopt a router mechanism to determine which experts to activate for routing input tokens. However, existing router mechanisms allocate a fixed number of experts to each token, which neglects the varying importance of different input tokens. In this study, we propose a novel dynamic router mechanism that Dynamically Allocates a variable number of experts for Mixture-of-Experts (DA-MoE) models based on an effective token importance measure. First, we show that the Transformer attention mechanism provides a natural and effective way of calculating token importance. Second, we propose a dynamic router mechanism that effectively decides the optimal number of experts (K) and allocates the top-K experts for each input token. Third, comprehensive experiments on several benchmark datasets demonstrate that our DA-MoE approach consistently outperforms the state-of-the-art Transformer based MoE model on the popular GLUE benchmark.

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

TL;DR

Abstract

Paper Structure (14 sections, 10 equations, 7 figures, 2 tables, 1 algorithm)

This paper contains 14 sections, 10 equations, 7 figures, 2 tables, 1 algorithm.

Introduction
Related Work
Problem Statement
DA-MoE Overview
DA-MoE Model Architecture
Dynamic Expert Allocation Algorithm
Experimental Analysis
Experimental Setup
Pre-training Settings
Fine-tuning Settings
Pre-training Evaluation
Fine-tuning Evaluation
Token Importance Analysis
Conclusion

Figures (7)

Figure 1: An example input sentence ("The movie was incredibly inspiring.") for performing sentiment analysis
Figure 2: Illustration of a DA-MoE encoder block with a dynamic routing mechanism. DA-MoE introduces a dynamic routing mechanism, allowing each token to be assigned to the top-K experts based on token importance. For example, token one is assigned to four experts, while token two is assigned to three experts.
Figure 3: Training log perplexity scaling comparison between DA-MoE and ST (Switch Transformer) base models.
Figure 4: Attention weights for 12 heads in the last layer for sentiment analysis task for the sentence ("The movie was incredibly inspiring.")
Figure 5: Attention weights for 12 heads in the last layer for paraphrasing task for the sentence ("He said he would come. He mentioned he was coming.")
...and 2 more figures

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

TL;DR

Abstract

DA-MoE: Towards Dynamic Expert Allocation for Mixture-of-Experts Models

Authors

TL;DR

Abstract

Table of Contents

Figures (7)