Towards Understanding Mixture of Experts in Deep Learning

Zixiang Chen; Yihe Deng; Yue Wu; Quanquan Gu; Yuanzhi Li

Towards Understanding Mixture of Experts in Deep Learning

Zixiang Chen, Yihe Deng, Yue Wu, Quanquan Gu, Yuanzhi Li

TL;DR

The paper offers a formal analysis of why sparse MoE layers with nonlinear experts outperform single models on cluster-structured data, showing that a router can learn cluster-centered routing and specialists can emerge among experts. It demonstrates a negative result for single experts and a positive result for nonlinear MoEs trained via gradient descent with routing perturbations, including a staged exploration and router-learning process. Through synthetic and real-data experiments, the work substantiates the importance of cluster structure and nonlinearities, and reveals that MoEs' benefits depend on task structure. Overall, this study provides foundational insight into MoE mechanisms beyond NTK and suggests practical training strategies to realize their potential in deep learning systems.

Abstract

The Mixture-of-Experts (MoE) layer, a sparsely-activated model controlled by a router, has achieved great success in deep learning. However, the understanding of such architecture remains elusive. In this paper, we formally study how the MoE layer improves the performance of neural network learning and why the mixture model will not collapse into a single model. Our empirical results suggest that the cluster structure of the underlying problem and the non-linearity of the expert are pivotal to the success of MoE. To further understand this, we consider a challenging classification problem with intrinsic cluster structures, which is hard to learn using a single expert. Yet with the MoE layer, by choosing the experts as two-layer nonlinear convolutional neural networks (CNNs), we show that the problem can be learned successfully. Furthermore, our theory shows that the router can learn the cluster-center features, which helps divide the input complex problem into simpler linear classification sub-problems that individual experts can conquer. To our knowledge, this is the first result towards formally understanding the mechanism of the MoE layer for deep learning.

Towards Understanding Mixture of Experts in Deep Learning

TL;DR

Abstract

Paper Structure (25 sections, 31 theorems, 143 equations, 8 figures, 11 tables, 1 algorithm)

This paper contains 25 sections, 31 theorems, 143 equations, 8 figures, 11 tables, 1 algorithm.

Introduction
Related Work
Problem Setting and Preliminaries
Data distribution
Structure of the MoE layer
Training Algorithm
Main Results
Overview of Key Techniques
Experiments
Synthetic-data Experiments
Real-data Experiments
Conclusion and Future Work
Experiment Details
Visualization
Synthetic-data Experiments
...and 10 more sections

Key Result

Theorem 4.1

Suppose $\mathcal{D}_{\alpha} = \mathcal{D}_{\gamma}$ in Definition def:data_distribution, then any function with the form $F(\mathbf{x}) = \sum_{p = 1}^{P}f(\mathbf{x}^{(p)})$ will get large test error $\mathbb{P}_{(\mathbf{x},y)\sim \mathcal{D}}(yF(\mathbf{x})\leq 0) \geq 1/8$.

Figures (8)

Figure 1: Visualization of the training of MoE with nonlinear expert and linear expert. Different colors denote router's dispatch to different experts. The lines denote the decision boundary of the MoE model. The data points are visualized on 2d space via t-SNE van2008visualizing. The MoE architecture follows section \ref{['section:problemsetting']} where nonlinear experts use activation function $\sigma(z)=z^3$. For this visualization, we let the expert number $M=4$ and cluster number $K=4$. We generate $n=1,600$ data points from the distribution illustrated in Section \ref{['section:problemsetting']} with $\alpha \in (0.5,2)$, $\beta \in (1,2)$, $\gamma \in (1,2)$, and $\sigma_p = 1$. More details of the visualization are discussed in Appendix \ref{['appendix:experiment']}.
Figure 2: Illustration of an MoE layer. For each input $\mathbf{x}$, the router will only select one expert to perform computations. The choice is based on the output of the gating network (dotted line). The expert layer returns the output of the selected expert (gray box) multiplied by the route gate value (softmax of the gating function output).
Figure 3: Illustration of router dispatch entropy. We demonstrate the change of entropy of MoE during training on the synthetic data. MoE (linear)-1 and MoE (nonlinear)-1 refer to Setting 1 in Table \ref{['tab:synthetic_exp_results_1']}. MoE (linear)-2 and MoE (nonlinear)-2 refer to Setting 2 in Table \ref{['tab:synthetic_exp_results_1']}.
Figure 4: Mixture of nonlinear experts. Growth of inner product between expert/router weight and center/feature vector.
Figure 5: Mixture of linear experts. Growth of inner product between expert/router weight and center/feature vector.
...and 3 more figures

Theorems & Definitions (34)

Definition 3.1
Theorem 4.1: Single expert performs poorly
Theorem 4.2: Nonlinear MoE performs well
Lemma 5.1
Lemma 5.2
Lemma C.1: Extension of Lemma \ref{['lm:Msmoothly']}
Remark C.2
Lemma C.3
Lemma C.4
Lemma D.1
...and 24 more

Towards Understanding Mixture of Experts in Deep Learning

TL;DR

Abstract

Towards Understanding Mixture of Experts in Deep Learning

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (8)

Theorems & Definitions (34)