A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Mohammed Nowaz Rabbani Chowdhury; Meng Wang; Kaoutar El Maghraoui; Naigang Wang; Pin-Yu Chen; Christopher Carothers

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers

TL;DR

This paper theoretically proves that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements.

Abstract

The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks, i.e., experts, through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory or computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as VMoE and E3MoE finetuned on benchmark datasets such as CIFAR10, CIFAR100, and ImageNet.

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

TL;DR

Abstract

Paper Structure (27 sections, 14 theorems, 42 equations, 10 figures, 1 algorithm)

This paper contains 27 sections, 14 theorems, 42 equations, 10 figures, 1 algorithm.

Introduction
Related Works
Method
The Mixture-of-Experts Architecture
Expert Pruning Method in MoE
Theoretical Guarantees of the Expert Pruning Method
Key Theoretical Findings
The Analysis Setup
Main Generalizalization Results of Expert Pruning
Experimental Results
Experiments on Synthetic Data
Experiments on State-of-the-art Vision MoE Models
Results
Conclusion
More Details on V-MoE and $\text{E}^3$-MoE
...and 12 more sections

Key Result

Lemma 4.1

Suppose the expert learning rate $\eta_e$, the router learning rate $\eta_r$, the batch-size $B$, and the number of iterations $T$ satisfy For any expert $s\in S_1$ such that $p_1^{(s,0)}=\Omega(1)$, we have (i) $p_1^{(s,T)}=1$, (ii) for every $(x,+1)\sim\mathcal{D}$, $G_j^{(s,T)}(x)>1/2$, if $x^{(j)}=o_1$, (iii) $\langle w_r^{(s,T)},o_1\rangle=\Omega( l\sqrt{d\log l})$, for a constant fraction $

Figures (10)

Figure 1: Generalization performance of the pruned VMoE on CIFAR-10 with post-pruning fine-tuning. 'pruned 2 exp/enc' implies pruning two experts from each MoE encoder.
Figure 2: Left: Token-choice routing: each token selects experts based on the routing values over the experts. Right: Expert-choice routing: Each expert selects tokens based on the routing value over the tokens. In both cases, the experts with a smaller norm change of router's weights are pruned (Expert 2). The output tokens for the pruned experts are set to zero (Token 1 in the figure). In the left, the routers of the pruned experts are retained to calculate the gating value. In the right, the routers of the pruned experts are also pruned.
Figure 3: (a) The norm of the post-training router weights, (b)(c) Projections of router weights to different directions, (d)(e) Projections of neuron weights to different directions (larger pixel intensity represents larger component of the router weights).
Figure 4: Generalization performance of the pruned V-MoE models: (a) Comparison with random pruning on CIFAR-10, (b) On CIFAR-10 w/o post-pruning fine-tuning, (c) On CIFAR-100 w/o post-pruning fine-tuning, (d) On CIFAR-100 with post-pruning fine-tuning, (e) On ImageNet w/o post-pruning fine-tuning, (f) On ImageNet with post-pruning fine-tuning
Figure 5: Comparison between different expert pruning methods: (a) vs. importance score on CIFAR-10, (b) vs. absolute magnitude on CIFAR-10, (c) vs. average change-in-neurons-magnitude on CIFAR-10, (d) vs. importance score on ImageNet, (e) vs. absolute magnitude on ImageNet, (f) vs. average change-in-neurons-magnitude on ImageNet
...and 5 more figures

Theorems & Definitions (26)

Lemma 4.1: Important experts become more specialized
Lemma 4.2: Unimportant experts stay unimportant
Theorem 4.3: Generalization of pruned model with no post-pruning fine-tuning
Lemma 4.4: Post-pruning fine-tuning promotes experts to learn task-specific features
Theorem 4.5: Generalization of pruned model with post-pruning fine-tuning
Definition 3.1: Important and Unimportant experts in pre-trained model
Lemma 4.1: Full version of the Lemma \ref{['lemma_1']}
proof
Lemma 5.1
proof
...and 16 more

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

TL;DR

Abstract

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Authors

TL;DR

Abstract

Table of Contents

Key Result

Figures (10)

Theorems & Definitions (26)