Table of Contents
Fetching ...

Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

Gyeongman Kim, Gyouk Chu, Eunho Yang

TL;DR

Mixture-of-Experts models offer scalable capacity but incur high memory costs due to dormant parameters, motivating targeted compression. The authors propose two MoE-specific knowledge distillation methods—Knowledge Augmentation (KA) and Student-Aware Router (SAR)—to extract knowledge from all experts, including those not activated during standard routing. Across five instruction-following datasets, KA and SAR consistently outperform traditional KD baselines when distilling from MoE teachers, while simply activating all experts is not universally beneficial. These results demonstrate the importance of tailoring distillation to the MoE architecture, enabling effective, resource-efficient deployment of large MoE language models.

Abstract

With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.

Every Expert Matters: Towards Effective Knowledge Distillation for Mixture-of-Experts Language Models

TL;DR

Mixture-of-Experts models offer scalable capacity but incur high memory costs due to dormant parameters, motivating targeted compression. The authors propose two MoE-specific knowledge distillation methods—Knowledge Augmentation (KA) and Student-Aware Router (SAR)—to extract knowledge from all experts, including those not activated during standard routing. Across five instruction-following datasets, KA and SAR consistently outperform traditional KD baselines when distilling from MoE teachers, while simply activating all experts is not universally beneficial. These results demonstrate the importance of tailoring distillation to the MoE architecture, enabling effective, resource-efficient deployment of large MoE language models.

Abstract

With the emergence of Mixture-of-Experts (MoE), the efficient scaling of model size has accelerated the development of large language models in recent years. However, their high memory requirements prevent their use in resource-constrained environments. While knowledge distillation (KD) has been a proven method for model compression, its application to MoE teacher models remains underexplored. Through our investigation, we discover that non-activated experts in MoE models possess valuable knowledge that benefits student models. We further demonstrate that existing KD methods are not optimal for compressing MoE models, as they fail to leverage this knowledge effectively. To address this, we propose two intuitive MoE-specific KD methods for the first time: Knowledge Augmentation (KA) and Student-Aware Router (SAR), both designed to effectively extract knowledge from all experts. Specifically, KA augments knowledge by sampling experts multiple times, while SAR uses all experts and adjusts the expert weights through router training to provide optimal knowledge. Extensive experiments show that our methods outperform conventional KD methods, demonstrating their effectiveness for MoE teacher models.

Paper Structure

This paper contains 24 sections, 7 equations, 5 figures, 3 tables, 2 algorithms.

Figures (5)

  • Figure 1: Sum of the gate probabilities for activated and non-activated experts per layer during distillation. The $(k/N)$ after each model name indicates that $k$ out of $N$ experts are activated. Across most layers of all Llama-MoE models, the sum of gated probabilities of activated experts is less than 50%.
  • Figure 2: Performance of the MoE teacher model and the student model after distillation with varying numbers of utilized experts $k$ (originally 4). As $k$ increases, the effectiveness of distillation improves, leading to better student performance. However, the performance of the teacher model itself does not necessarily improve with a larger $k$.
  • Figure 3: An overview of our proposed KD methods specifically designed for the MoE. In knowledge augmentation, we either select the $\text{top }N-1$ experts or sample $N-1$ experts based on the gate probability. We do this $M$ times to augment various knowledge. In student-aware router, we train the router network with student feedback before the distillation. It enables the router to determine the optimal weights, thereby facilitating the student's acquisition of knowledge from all experts.
  • Figure 4: Average performance of KA for a different number of samples, $M$, across all test data. $\lambda$ is fixed at 0.05. For each MoE teacher, the best performing $M$ differs. If $M$ is too large, all models exhibit reduced performance.
  • Figure 5: KL divergence of gate probabilities between original router and router trained with SAR method. The mean value is averaged over all tokens in training data. Consistently, KL divergence increases with layer depth.