Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Boan Liu; Liang Ding; Li Shen; Keqin Peng; Yu Cao; Dazhao Cheng; Dacheng Tao

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Boan Liu, Liang Ding, Li Shen, Keqin Peng, Yu Cao, Dazhao Cheng, Dacheng Tao

TL;DR

This work tackles degeneracy in Mixture-of-Experts models—particularly homogeneous representations across experts—by introducing OMoE, an orthogonal optimizer that enforces diversity among MoE experts. OMoE alternates between an accumulating phase with a base optimizer and an orthogonal phase that updates each expert in directions orthogonal to the subspaces of others, using averaged projector matrices to guide updates. Empirical results across GLUE, SuperGLUE, QA, and NER show systematic improvements over standard AdamW-based MoE baselines, with notable gains on several tasks and a robust demonstration of increased expert diversity. The approach offers a practical, architecture-agnostic method to enhance MoE expressivity with moderate overhead, and represents the first application of Orthogonal Weight Modification to MoE optimization.

Abstract

The Mixture of Experts (MoE) has emerged as a highly successful technique in deep learning, based on the principle of divide-and-conquer to maximize model capacity without significant additional computational cost. Even in the era of large-scale language models (LLMs), MoE continues to play a crucial role, as some researchers have indicated that GPT-4 adopts the MoE structure to ensure diverse inference results. However, MoE is susceptible to performance degeneracy, particularly evident in the issues of imbalance and homogeneous representation among experts. While previous studies have extensively addressed the problem of imbalance, the challenge of homogeneous representation remains unresolved. In this study, we shed light on the homogeneous representation problem, wherein experts in the MoE fail to specialize and lack diversity, leading to frustratingly high similarities in their representations (up to 99\% in a well-performed MoE model). This problem restricts the expressive power of the MoE and, we argue, contradicts its original intention. To tackle this issue, we propose a straightforward yet highly effective solution: OMoE, an orthogonal expert optimizer. Additionally, we introduce an alternating training strategy that encourages each expert to update in a direction orthogonal to the subspace spanned by other experts. Our algorithm facilitates MoE training in two key ways: firstly, it explicitly enhances representation diversity, and secondly, it implicitly fosters interaction between experts during orthogonal weights computation. Through extensive experiments, we demonstrate that our proposed optimization algorithm significantly improves the performance of fine-tuning the MoE model on the GLUE benchmark, SuperGLUE benchmark, question-answering task, and name entity recognition tasks.

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

TL;DR

Abstract

Paper Structure (22 sections, 10 equations, 5 figures, 7 tables, 3 algorithms)

This paper contains 22 sections, 10 equations, 5 figures, 7 tables, 3 algorithms.

Background
Mixture of Expert
Orthogonal Weights Modification (OWM)
Orthogonal Optimizer for MoE
Degeneracy in MoE
OWM for Experts
Experiment
Compared Models
Fine-tune on GLUE
Fine-tune on SuperGLUE
Fine-tune on Question-Answering and Named Entity Recognition
Analysis
Ablation Study: Effects of Skipping Step
Ablation Study: Number of Experts
Ablation Study: Kind of Optimizer
...and 7 more sections

Figures (5)

Figure 1: The overview of OMoE optimizer. ① After being selected by the Gating Function, the input is sent to different experts. ② Experts calculate the corresponding orthogonal projector based on their input. ③ Based on the orthogonal projectors of the other experts (e.g. blue expert), the current expert to be updated (e.g. red expert) calculates the average projector. The average projector represents the orthogonal subspace of other experts. ④ Using the projector calculated in the previous step, the parameters are updated in the orthogonal direction of the other experts.
Figure 2: The full training process of OMoE. OMoE consists of two optimizers: the base optimizer (the blue Optimizer in the figure) and the OWM optimizer (the red OWM-Optimizer in the figure). The training process also consists of 2 kinds of alternative steps: R Step (correspondents to the accumulating phase) and O Step (correspondents to the orthogonal phase). In R Step, $\Delta \mathbf{W}_l^{BP}(i)$ directly guides the update of parameters $\theta$ and $\phi$. In O Step, average orthogonal projector $\mathbf{\overline{P}}^{m}_l(i)$ is calculated based on average input $\overline{\mathbf{x}}_{l-1}(i)$ and then guide the gradient to be orthogonal. The two optimizers share the states like momentum and hyperparameters.
Figure 3: The figure shows to what extent OMoE expands the difference between experts. The first 3 columns are for experts in BERT while the other 3 columns are for experts in RoBERTa. The depth of color in the figure represents the percentage of parameters with a larger difference in OMoE compared to AdamW.
Figure 4: The normalized variance of parameters in experts with different skipping steps and the normalized GLUE scores with different skipping steps.
Figure 5: GLUE score improvement with different numbers of experts using OMoE optimizer. In general, the benefit of orthogonal updates will become minor as the number of experts increases.

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

TL;DR

Abstract

Diversifying the Mixture-of-Experts Representation for Language Models with Orthogonal Optimizer

Authors

TL;DR

Abstract

Table of Contents

Figures (5)