Table of Contents
Fetching ...

Mixture of Neuron Experts

Runxi Cheng, Yuchen Guan, Yucheng Ding, Qingguo Hu, Yongxian Wei, Chun Yuan, Yelong Shen, Weizhu Chen, Yeyun Gong

TL;DR

This work reveals that parameters activated by MoE layers remain highly sparse at inference and that many neurons within each expert are effectively inactive. By decomposing experts into neuron-level sub-experts and applying a simple top-$K$ selection within each expert, the authors propose Mixture of Neuron Experts (MoNE), supplemented by a neuron-granular load-balance loss (NG-LBL) to encourage balanced usage. Empirical results show MoNE matches traditional MoE performance while using only about 50% of the MoE parameters, and it often outperforms MoE when activated parameters are held constant, all without additional routing parameters or inter-expert communication. The approach offers a practical, low-latency pathway to improve parameter utilization and inference efficiency in MoE-like models, especially as expert granularity scales.

Abstract

In this work, we first explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to 60% of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than 90% are removed. We further decompose experts into neuron-granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert selection by only applying a simple top-k selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches traditional MoE performance while activating only 50% of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.

Mixture of Neuron Experts

TL;DR

This work reveals that parameters activated by MoE layers remain highly sparse at inference and that many neurons within each expert are effectively inactive. By decomposing experts into neuron-level sub-experts and applying a simple top- selection within each expert, the authors propose Mixture of Neuron Experts (MoNE), supplemented by a neuron-granular load-balance loss (NG-LBL) to encourage balanced usage. Empirical results show MoNE matches traditional MoE performance while using only about 50% of the MoE parameters, and it often outperforms MoE when activated parameters are held constant, all without additional routing parameters or inter-expert communication. The approach offers a practical, low-latency pathway to improve parameter utilization and inference efficiency in MoE-like models, especially as expert granularity scales.

Abstract

In this work, we first explore whether the parameters activated by the MoE layer remain highly sparse at inference. We perform a sparsification study on several representative MoE models. For each expert, we rank parameters by the magnitude of their activations from the gate projection and progressively prune the activated subset. Pruning up to 60% of parameters within that subset causes only negligible task-performance degradation; substantial drops occur only after more than 90% are removed. We further decompose experts into neuron-granular MoE and visualize their activation values, finding that most neuron activations are near zero. This observation motivates us to select only high-activation neuron experts during pretraining. Based on this insight, we propose Mixture of Neuron Experts (MoNE). MoNE achieves neuron-granular expert selection by only applying a simple top-k selection within each expert, incurs negligible latency, and requires no additional routing parameters or inter-expert communication. Extensive experiments demonstrate that MoNE matches traditional MoE performance while activating only 50% of the MoE-layer parameters, and it consistently outperforms traditional MoE when compared at equal numbers of activated parameters. These results suggest that MoNE is a practical approach to improving parameter utilization and inference efficiency in MoE-like models.

Paper Structure

This paper contains 23 sections, 12 equations, 11 figures, 7 tables, 1 algorithm.

Figures (11)

  • Figure 1: The performance of mainstream MoE models when only use the neuron experts with higher activation weight without extra training. Top-K Ratio refers to the ratio of selected neuron experts.
  • Figure 2: The activation value for the neuron experts, and the top 50% of these values were highlighted.
  • Figure 3: Expert in traditional MoE can be decomposed as the weighted sum of neuron granular FFN, which can be realized as a neuron granular MoE.
  • Figure 4: The comparison of the activation value $\mathbf{G}$ for the neuron experts between traditional MoE and MoNE. MoNE effectively increase the activation weight compared with traditional MoE.
  • Figure 5: Pre-training loss between traditional MoE and MoNE
  • ...and 6 more figures