Table of Contents
Fetching ...

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

Zhitian Xie, Yinger Zhang, Chenyi Zhuang, Qitao Shi, Zhining Liu, Jinjie Gu, Guannan Zhang

TL;DR

MoDE introduces a moderate mutual distillation mechanism among experts in a Mixture-of-Experts (MoE) architecture to address the narrow-vision problem where each expert learns from a limited subset of samples. By enforcing a distillation loss across expert representations with a tunable strength $\alpha$, MoDE encourages experts to incorporate complementary features learned by others while preserving specialization. Across tabular, NLP, and computer-vision tasks, MoDE consistently improves MoE performance and gate routing accuracy, with expert probing revealing that moderate distillation enhances individual expert performance in their designated sub-tasks without collapsing into identical networks. The results establish MoDE as a universal, robust enhancement to MoE, providing insights into feature utilisation and offering a practical path to boost generalization in diverse domains.

Abstract

The application of mixture-of-experts (MoE) is gaining popularity due to its ability to improve model's performance. In an MoE structure, the gate layer plays a significant role in distinguishing and routing input features to different experts. This enables each expert to specialize in processing their corresponding sub-tasks. However, the gate's routing mechanism also gives rise to narrow vision: the individual MoE's expert fails to use more samples in learning the allocated sub-task, which in turn limits the MoE to further improve its generalization ability. To effectively address this, we propose a method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts and gain more accurate perceptions on their original allocated sub-tasks. We conduct plenty experiments including tabular, NLP and CV datasets, which shows MoDE's effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing "expert probing", to experimentally prove why MoDE works: moderate distilling knowledge can improve each individual expert's test performances on their assigned tasks, leading to MoE's overall performance improvement.

MoDE: A Mixture-of-Experts Model with Mutual Distillation among the Experts

TL;DR

MoDE introduces a moderate mutual distillation mechanism among experts in a Mixture-of-Experts (MoE) architecture to address the narrow-vision problem where each expert learns from a limited subset of samples. By enforcing a distillation loss across expert representations with a tunable strength , MoDE encourages experts to incorporate complementary features learned by others while preserving specialization. Across tabular, NLP, and computer-vision tasks, MoDE consistently improves MoE performance and gate routing accuracy, with expert probing revealing that moderate distillation enhances individual expert performance in their designated sub-tasks without collapsing into identical networks. The results establish MoDE as a universal, robust enhancement to MoE, providing insights into feature utilisation and offering a practical path to boost generalization in diverse domains.

Abstract

The application of mixture-of-experts (MoE) is gaining popularity due to its ability to improve model's performance. In an MoE structure, the gate layer plays a significant role in distinguishing and routing input features to different experts. This enables each expert to specialize in processing their corresponding sub-tasks. However, the gate's routing mechanism also gives rise to narrow vision: the individual MoE's expert fails to use more samples in learning the allocated sub-task, which in turn limits the MoE to further improve its generalization ability. To effectively address this, we propose a method called Mixture-of-Distilled-Expert (MoDE), which applies moderate mutual distillation among experts to enable each expert to pick up more features learned by other experts and gain more accurate perceptions on their original allocated sub-tasks. We conduct plenty experiments including tabular, NLP and CV datasets, which shows MoDE's effectiveness, universality and robustness. Furthermore, we develop a parallel study through innovatively constructing "expert probing", to experimentally prove why MoDE works: moderate distilling knowledge can improve each individual expert's test performances on their assigned tasks, leading to MoE's overall performance improvement.
Paper Structure (27 sections, 10 equations, 4 figures, 10 tables)

This paper contains 27 sections, 10 equations, 4 figures, 10 tables.

Figures (4)

  • Figure 1: Narrow Vision. A and B represent two subsets of the training samples, while E1/2 represent expert 1/2. The normalized histogram schematically presents the ratio of gradients on each expert in learning the same subset. The narrow vision in MoE is indeed significant and MoDE alleviates it through distillation. As a result, it improves the accuracy of individual experts on their dominating sample-based task domain (referred as DS in the following section).
  • Figure 2: Overview of MoDE.
  • Figure 3: Task Domain Distribution of MoE, MoDE ($\alpha$=0.01), MoDE ($\alpha$=100).
  • Figure 4: MoDE w.r.t the distillation strength $\alpha$.