Table of Contents
Fetching ...

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

Mohammed Nowaz Rabbani Chowdhury, Meng Wang, Kaoutar El Maghraoui, Naigang Wang, Pin-Yu Chen, Christopher Carothers

TL;DR

This paper theoretically proves that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements.

Abstract

The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks, i.e., experts, through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory or computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as VMoE and E3MoE finetuned on benchmark datasets such as CIFAR10, CIFAR100, and ImageNet.

A Provably Effective Method for Pruning Experts in Fine-tuned Sparse Mixture-of-Experts

TL;DR

This paper theoretically proves that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements.

Abstract

The sparsely gated mixture of experts (MoE) architecture sends different inputs to different subnetworks, i.e., experts, through trainable routers. MoE reduces the training computation significantly for large models, but its deployment can be still memory or computation expensive for some downstream tasks. Model pruning is a popular approach to reduce inference computation, but its application in MoE architecture is largely unexplored. To the best of our knowledge, this paper provides the first provably efficient technique for pruning experts in finetuned MoE models. We theoretically prove that prioritizing the pruning of the experts with a smaller change of the routers l2 norm from the pretrained model guarantees the preservation of test accuracy, while significantly reducing the model size and the computational requirements. Although our theoretical analysis is centered on binary classification tasks on simplified MoE architecture, our expert pruning method is verified on large vision MoE models such as VMoE and E3MoE finetuned on benchmark datasets such as CIFAR10, CIFAR100, and ImageNet.
Paper Structure (27 sections, 14 theorems, 42 equations, 10 figures, 1 algorithm)

This paper contains 27 sections, 14 theorems, 42 equations, 10 figures, 1 algorithm.

Key Result

Lemma 4.1

Suppose the expert learning rate $\eta_e$, the router learning rate $\eta_r$, the batch-size $B$, and the number of iterations $T$ satisfy For any expert $s\in S_1$ such that $p_1^{(s,0)}=\Omega(1)$, we have (i) $p_1^{(s,T)}=1$, (ii) for every $(x,+1)\sim\mathcal{D}$, $G_j^{(s,T)}(x)>1/2$, if $x^{(j)}=o_1$, (iii) $\langle w_r^{(s,T)},o_1\rangle=\Omega( l\sqrt{d\log l})$, for a constant fraction $

Figures (10)

  • Figure 1: Generalization performance of the pruned VMoE on CIFAR-10 with post-pruning fine-tuning. 'pruned 2 exp/enc' implies pruning two experts from each MoE encoder.
  • Figure 2: Left: Token-choice routing: each token selects experts based on the routing values over the experts. Right: Expert-choice routing: Each expert selects tokens based on the routing value over the tokens. In both cases, the experts with a smaller norm change of router's weights are pruned (Expert 2). The output tokens for the pruned experts are set to zero (Token 1 in the figure). In the left, the routers of the pruned experts are retained to calculate the gating value. In the right, the routers of the pruned experts are also pruned.
  • Figure 3: (a) The norm of the post-training router weights, (b)(c) Projections of router weights to different directions, (d)(e) Projections of neuron weights to different directions (larger pixel intensity represents larger component of the router weights).
  • Figure 4: Generalization performance of the pruned V-MoE models: (a) Comparison with random pruning on CIFAR-10, (b) On CIFAR-10 w/o post-pruning fine-tuning, (c) On CIFAR-100 w/o post-pruning fine-tuning, (d) On CIFAR-100 with post-pruning fine-tuning, (e) On ImageNet w/o post-pruning fine-tuning, (f) On ImageNet with post-pruning fine-tuning
  • Figure 5: Comparison between different expert pruning methods: (a) vs. importance score on CIFAR-10, (b) vs. absolute magnitude on CIFAR-10, (c) vs. average change-in-neurons-magnitude on CIFAR-10, (d) vs. importance score on ImageNet, (e) vs. absolute magnitude on ImageNet, (f) vs. average change-in-neurons-magnitude on ImageNet
  • ...and 5 more figures

Theorems & Definitions (26)

  • Lemma 4.1: Important experts become more specialized
  • Lemma 4.2: Unimportant experts stay unimportant
  • Theorem 4.3: Generalization of pruned model with no post-pruning fine-tuning
  • Lemma 4.4: Post-pruning fine-tuning promotes experts to learn task-specific features
  • Theorem 4.5: Generalization of pruned model with post-pruning fine-tuning
  • Definition 3.1: Important and Unimportant experts in pre-trained model
  • Lemma 4.1: Full version of the Lemma \ref{['lemma_1']}
  • proof
  • Lemma 5.1
  • proof
  • ...and 16 more