Table of Contents
Fetching ...

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

Quang Pham, Giang Do, Huy Nguyen, TrungTin Nguyen, Chenghao Liu, Mina Sartipi, Binh T. Nguyen, Savitha Ramasamy, Xiaoli Li, Steven Hoi, Nhat Ho

TL;DR

CompeteSMoE addresses representation collapse in sparse mixture-of-experts by introducing a competition-based routing policy that selects winning experts based on their outputs and distills this policy into an efficient router. The framework comes with a theoretical guarantee showing the competition achieves the same convergence rate as the optimal hindsight estimator, and a practical algorithm that alternates competition-guided routing with standard task optimization under a schedule $\lambda(t)$. Empirically, CompeteSMoE yields robust improvements across two Transformer-based architectures (Switch Transformer and GLaM) on pre-training and finetuning tasks, with strong transfer performance and modest overhead. The work also provides a density-estimation and parameter-estimation analysis yielding parametric-rate guarantees, supporting scalable SMoE training via competition.

Abstract

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.

CompeteSMoE -- Effective Training of Sparse Mixture of Experts via Competition

TL;DR

CompeteSMoE addresses representation collapse in sparse mixture-of-experts by introducing a competition-based routing policy that selects winning experts based on their outputs and distills this policy into an efficient router. The framework comes with a theoretical guarantee showing the competition achieves the same convergence rate as the optimal hindsight estimator, and a practical algorithm that alternates competition-guided routing with standard task optimization under a schedule . Empirically, CompeteSMoE yields robust improvements across two Transformer-based architectures (Switch Transformer and GLaM) on pre-training and finetuning tasks, with strong transfer performance and modest overhead. The work also provides a density-estimation and parameter-estimation analysis yielding parametric-rate guarantees, supporting scalable SMoE training via competition.

Abstract

Sparse mixture of experts (SMoE) offers an appealing solution to scale up the model complexity beyond the mean of increasing the network's depth or width. However, effective training of SMoE has proven to be challenging due to the representation collapse issue, which causes parameter redundancy and limited representation potentials. In this work, we propose a competition mechanism to address this fundamental challenge of representation collapse. By routing inputs only to experts with the highest neural response, we show that, under mild assumptions, competition enjoys the same convergence rate as the optimal estimator. We further propose CompeteSMoE, an effective and efficient algorithm to train large language models by deploying a simple router that predicts the competition outcomes. Consequently, CompeteSMoE enjoys strong performance gains from the competition routing policy while having low computation overheads. Our extensive empirical evaluations on two transformer architectures and a wide range of tasks demonstrate the efficacy, robustness, and scalability of CompeteSMoE compared to state-of-the-art SMoE strategies.
Paper Structure (31 sections, 4 theorems, 55 equations, 6 figures, 10 tables)

This paper contains 31 sections, 4 theorems, 55 equations, 6 figures, 10 tables.

Key Result

Theorem 4.1

With the MLE defined in equation eq:MLE, the convergence rate of density estimation $p_{\widehat{G}_n}(Y|X)$ to the true density $p_{G_*}(Y|X)$ is given by: for some universal positive constants $C$ and $c$ depending only on $\Theta$. Here, $h$ is the Hellinger distance defined as $h^2(f_1,f_2):=\frac{1}{2}\int(\sqrt{f_1}-\sqrt{f_2})^2\mathrm{d} \nu$ for any two probability density functions $f_1

Figures (6)

  • Figure 1: An illustrative of the CompeteSMoE algorithm on three experts.
  • Figure 2: Validation loss of the small transformer model on enwik8 throughout training.
  • Figure 3: BPC on enwik8 wrt the number of experts activated.
  • Figure 4: Visualization of the distribution for the output of routers.
  • Figure 5: Visualization of the distribution for the output of routers.
  • ...and 1 more figures

Theorems & Definitions (4)

  • Theorem 4.1: Density Estimation
  • Theorem 4.2: Parameter Estimation
  • Lemma 2.1: Theorem 7.4,Vandegeer-2000
  • Lemma 2.2