Table of Contents
Fetching ...

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

Ziyu Zhao, Tong Zhu, Zhi Zhang, Tiantian Fan, Jinluan Yang, Kun Kuang, Zhongyu Wei, Fei Wu, Yu Cheng

TL;DR

This work argues that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion, and introduces ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.

Abstract

Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.

ExpertWeaver: Unlocking the Inherent MoE in Dense LLMs with GLU Activation Patterns

TL;DR

This work argues that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion, and introduces ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations.

Abstract

Mixture-of-Experts (MoE) effectively scales model capacity while preserving computational efficiency through sparse expert activation. However, training high-quality MoEs from scratch is prohibitively expensive. A promising alternative is to convert pretrained dense models into sparse MoEs. Existing dense-to-MoE methods fall into two categories: \textbf{dynamic structural pruning} that converts dense models into MoE architectures with moderate sparsity to balance performance and inference efficiency, and \textbf{downcycling} approaches that use pretrained dense models to initialize highly sparse MoE architectures. However, existing methods break the intrinsic activation patterns within dense models, leading to suboptimal expert construction. In this work, we argue that the Gated Linear Unit (GLU) mechanism provides a natural blueprint for dense-to-MoE conversion. We show that the fine-grained neural-wise activation patterns of GLU reveal a coarse-grained structure, uncovering an inherent MoE architecture composed of consistently activated universal neurons and dynamically activated specialized neurons. Leveraging this discovery, we introduce ExpertWeaver, a training-free framework that partitions neurons according to their activation patterns and constructs shared experts and specialized routed experts with layer-adaptive configurations. Our experiments demonstrate that ExpertWeaver significantly outperforms existing methods, both as a training-free dynamic structural pruning technique and as a downcycling strategy for superior MoE initialization.
Paper Structure (62 sections, 15 equations, 8 figures, 10 tables)

This paper contains 62 sections, 15 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Neuron activation patterns across diverse tasks. We visualize the middle layer's activation patterns from Qwen2.5-7B on a subset of Flan-v2. a) Activation distribution of neurons across different tasks. b) Activation distribution of neurons within individual task clusters, where tasks belonging to the same cluster are enclosed in boxes.
  • Figure 2: Neuron Coefficient of Variation Across Layers.
  • Figure 3: The ExpertWeaver Framework. a) The GLU in the MLP layer contains three weight matrices, where the same color denotes corresponding neuron slices. b) Neuron activation patterns are captured using a multi-task calibration dataset. c) The CVs are computed to determine the budget for shared vs. routed experts. (d) Neurons are clustered according to their activation patterns to form one shared expert and multiple routed experts.
  • Figure 4: Comparison of Downcycling, Upcycling, and From-Scratch Training. Comparison of training loss, evaluation loss, and downstream task performance using the same OLMoE model configuration under three different MoE initialization paradigms.
  • Figure 5: Training Loss Comparison with Different MoE Initialization Strategies.
  • ...and 3 more figures