Table of Contents
Fetching ...

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

Chufan Shi, Cheng Yang, Xinyu Zhu, Jiahao Wang, Taiqiang Wu, Siheng Li, Deng Cai, Yujiu Yang, Yu Meng

TL;DR

<3-5 sentence high-level summary> MoE models leave a large portion of parameters unused during inference, and simply activating more experts can sometimes harm performance. The authors analyze routing-strategy divergence and introduce Self-Contrast Mixture-of-Experts (SCMoE), a training-free decoding method that leverages contrast between strong (top-2) and weak (rank-k) activations to adjust next-token logits. Across GSM8K, StrategyQA, MBPP, and HumanEval (with Mixtral 8x7B and DeepSeekMoE-16B), SCMoE yields consistent improvements with only modest latency overhead, and gains further when combined with self-consistency. This work demonstrates a practical, scalable way to unlock MoE capacity by exploiting unchosen experts and offers a promising direction for inference-time decoding in sparse-activation models.

Abstract

Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. In MoE, each token in the input sequence activates a different subset of experts determined by a routing mechanism. However, the unchosen experts in MoE models do not contribute to the output, potentially leading to underutilization of the model's capacity. In this work, we first conduct exploratory studies to demonstrate that increasing the number of activated experts does not necessarily improve and can even degrade the output quality. Then, we show that output distributions from an MoE model using different routing strategies substantially differ, indicating that different experts do not always act synergistically. Motivated by these findings, we propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. In SCMoE, the next-token probabilities are determined by contrasting the outputs from strong and weak activation using the same MoE model. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding. Experiments on several benchmarks (GSM8K, StrategyQA, MBPP and HumanEval) demonstrate that SCMoE can consistently enhance Mixtral 8x7B's reasoning capability across various domains. For example, it improves the accuracy on GSM8K from 61.79 to 66.94. Moreover, combining SCMoE with self-consistency yields additional gains, increasing major@20 accuracy from 75.59 to 78.31.

Unchosen Experts Can Contribute Too: Unleashing MoE Models' Power by Self-Contrast

TL;DR

<3-5 sentence high-level summary> MoE models leave a large portion of parameters unused during inference, and simply activating more experts can sometimes harm performance. The authors analyze routing-strategy divergence and introduce Self-Contrast Mixture-of-Experts (SCMoE), a training-free decoding method that leverages contrast between strong (top-2) and weak (rank-k) activations to adjust next-token logits. Across GSM8K, StrategyQA, MBPP, and HumanEval (with Mixtral 8x7B and DeepSeekMoE-16B), SCMoE yields consistent improvements with only modest latency overhead, and gains further when combined with self-consistency. This work demonstrates a practical, scalable way to unlock MoE capacity by exploiting unchosen experts and offers a promising direction for inference-time decoding in sparse-activation models.

Abstract

Mixture-of-Experts (MoE) has emerged as a prominent architecture for scaling model size while maintaining computational efficiency. In MoE, each token in the input sequence activates a different subset of experts determined by a routing mechanism. However, the unchosen experts in MoE models do not contribute to the output, potentially leading to underutilization of the model's capacity. In this work, we first conduct exploratory studies to demonstrate that increasing the number of activated experts does not necessarily improve and can even degrade the output quality. Then, we show that output distributions from an MoE model using different routing strategies substantially differ, indicating that different experts do not always act synergistically. Motivated by these findings, we propose Self-Contrast Mixture-of-Experts (SCMoE), a training-free strategy that utilizes unchosen experts in a self-contrast manner during inference. In SCMoE, the next-token probabilities are determined by contrasting the outputs from strong and weak activation using the same MoE model. Our method is conceptually simple and computationally lightweight, as it incurs minimal latency compared to greedy decoding. Experiments on several benchmarks (GSM8K, StrategyQA, MBPP and HumanEval) demonstrate that SCMoE can consistently enhance Mixtral 8x7B's reasoning capability across various domains. For example, it improves the accuracy on GSM8K from 61.79 to 66.94. Moreover, combining SCMoE with self-consistency yields additional gains, increasing major@20 accuracy from 75.59 to 78.31.
Paper Structure (41 sections, 6 equations, 5 figures, 11 tables)

This paper contains 41 sections, 6 equations, 5 figures, 11 tables.

Figures (5)

  • Figure 1: Performance comparison between increasing the value of top-$k$ (i.e., ensemble routing) and SCMoE. SCMoE surpasses the performance of ensemble routing across various benchmarks.
  • Figure 2: (a & b) Given an input $\mathbf{h}$, (a) and (b) demonstrate the workflows of top-2 routing and rank-$k$ routing (e.g., $k$=2). We use two MoE layers as a simple schematic, omitting other layers in MoE models. Note that, in the second MoE layer, rank-$k$ routing activates the unchosen expert in top-2 routing; (c) An illustrative example of how SCMoE works, which contrasts $z_{\text{top-2}}( x_{t} | x_{<t})$ with $z_{\text{rank-k}}( x_{t} | x_{<t} )$. The complete question and answer for this example are shown in Figure \ref{['fig:figure3']}.
  • Figure 3: Heatmap of Kullback-Leibler Divergence between the output distribution of top-2 routing strategy ( $p_{\text{top-2}}( x_{t} | x_{<t})$ ) and different rank-$k$ routing strategies ( $p_{\text{rank-k}}( x_{t} | x_{<t} )$ ). The $k$ in rank-$k$ routing ranges from 1 to 8. The values in the heatmap are scaled by $10^{5}$. This example is taken from the GSM8K dataset. An additional quantitative study of the KLD is provided in Appendix \ref{['appendix:a']}.
  • Figure 4: Experimental results of different weak activations. We set the strong activation with top-2 routing in SCMoE. The detailed results with their hyperparameters are report in Appendix Table \ref{['tab:a.weak']}.
  • Figure 5: Experimental results on combining SCMoE with self-consistency on GSM8K using Mixtral 8x7B.