Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Yongxin Guo; Zhenglin Cheng; Xiaoying Tang; Zhaopeng Tu; Tao Lin

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, Tao Lin

TL;DR

This work tackles the sensitivity and inefficiency inherent in Sparse MoE by introducing DynMoE, which combines a tuning-free top-any gating mechanism with an adaptive process that automatically grows or trims experts during training. A novel auxiliary loss encourages diverse yet compact expert representations to maintain efficiency, while test-time safeguards prevent token dropping. Across Vision, Language, and Vision-Language tasks, DynMoE achieves competitive or superior performance with fewer activated parameters and improved throughput, revealing insights such as the need for MoE primarily in bottom layers and the presence of shared experts across layers. Overall, the approach reduces hyperparameter search costs and offers practical, cross-domain gains in efficiency and scalability for large transformer models.

Abstract

The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results.However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

TL;DR

Abstract

Paper Structure (41 sections, 7 equations, 24 figures, 24 tables, 1 algorithm)

This paper contains 41 sections, 7 equations, 24 figures, 24 tables, 1 algorithm.

Introduction
Related Works
Method
Top-Any Gating
Traditional top-$k$ gating and the limitations.
Addressing the limitations of top-$k$ gating by tuning-free top-any gating.
Improving the top-any gating during test-time to prevent token dropping.
Guarding efficiency for top-any gating by auxiliary loss.
Adaptive Training Process
Routing Recording.
Adding Experts when there exist tokens that choose not to activate any experts.
Removing Experts when there exist experts not activated by any token.
Experiments
Experiment Setup
A1: DynMoE Achieves Competitive Performance among Various MoE Settings
...and 26 more sections

Figures (24)

Figure 1: Illustration of performance and efficiency of DynMoE. In Figure \ref{['fig:illustration-variance']}, we carried out experiments on GLUE benchmark wang2018glue, employing BERT-large devlin2019bert as backbone. In Figure\ref{['fig:illustration-efficiency']}, we follow the MoE-LLaVA lin2024moe settings, the $x$-axis represents the number of activated parameters, while the $y$-axis shows the performance on the Visual Question Answering (VQA) task.
Figure 2: Illustration of the top-any gating method. The input tokens pass through the gating weights $\mathbf{W}_{g,e}$ corresponding to each expert $e$, obtaining the gating scores. The scores surpass gates $\mathbf{G}_e$ will activate the subsequent expert. Finally, the expert outputs are combined to produce the output tokens.
Figure 3: Elaboration on the adaptive training process. We visualize the adaptive training process of DynMoE, including record routing, experts adding, and experts removing. The green strip connecting the token and the expert indicates records of a token routing to an expert. The red arrow at the bottom part of the figure shows where and when expert addition and removal happens.
Figure 4: Performance of DynMoE on language tasks. We conduct experiments on the GLUE benchmark. The $x$-axis represents MoE settings with varying $K$ and top-$k$ values. The $y$-axis denotes the model's performance. Dashed lines indicate the average performance across different settings, as well as the performance of DynMoE. For all the MoE settings, we tune the learning rates in {2e-5, 3e-5, 5e-5} and report the best results. We also report the times when each MoE setting attains the top-2 best results across all configurations.
Figure 5: Average top-$k$ activated experts of DynMoE on vision-language benchmarks. We record average top-$k$ activated experts for each MoE layer when using StableLM-1.6B as the language model backbone.
...and 19 more figures

Theorems & Definitions (1)

Remark 3.1: Discussion on not to consider the magnitude of scores when averaging the expert outputs.

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

TL;DR

Abstract

Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models

Authors

TL;DR

Abstract

Table of Contents

Figures (24)

Theorems & Definitions (1)