Dynamic Mixture of Experts: An Auto-Tuning Approach for Efficient Transformer Models
Yongxin Guo, Zhenglin Cheng, Xiaoying Tang, Zhaopeng Tu, Tao Lin
TL;DR
This work tackles the sensitivity and inefficiency inherent in Sparse MoE by introducing DynMoE, which combines a tuning-free top-any gating mechanism with an adaptive process that automatically grows or trims experts during training. A novel auxiliary loss encourages diverse yet compact expert representations to maintain efficiency, while test-time safeguards prevent token dropping. Across Vision, Language, and Vision-Language tasks, DynMoE achieves competitive or superior performance with fewer activated parameters and improved throughput, revealing insights such as the need for MoE primarily in bottom layers and the presence of shared experts across layers. Overall, the approach reduces hyperparameter search costs and offers practical, cross-domain gains in efficiency and scalability for large transformer models.
Abstract
The Sparse Mixture of Experts (SMoE) has been widely employed to enhance the efficiency of training and inference for Transformer-based foundational models, yielding promising results.However, the performance of SMoE heavily depends on the choice of hyper-parameters, such as the number of experts and the number of experts to be activated (referred to as top-k), resulting in significant computational overhead due to the extensive model training by searching over various hyper-parameter configurations. As a remedy, we introduce the Dynamic Mixture of Experts (DynMoE) technique. DynMoE incorporates (1) a novel gating method that enables each token to automatically determine the number of experts to activate. (2) An adaptive process automatically adjusts the number of experts during training. Extensive numerical results across Vision, Language, and Vision-Language tasks demonstrate the effectiveness of our approach to achieve competitive performance compared to GMoE for vision and language tasks, and MoE-LLaVA for vision-language tasks, while maintaining efficiency by activating fewer parameters. Our code is available at https://github.com/LINs-lab/DynMoE.
