Table of Contents
Fetching ...

CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference

Zehua Pei, Lancheng Zou, Hui-Ling Zhen, Xianzhi Yu, Wulong Liu, Sinno Jialin Pan, Mingxuan Yuan, Bei Yu

TL;DR

CMoE tackles the high inference cost of large language models by converting dense FFNs into Mixture-of-Experts without training. It achieves this by profiling FFN activations to separate shared (always-active) and routed (sparsely-active) neurons, then forming routed experts through balanced clustering and deriving a differentiable router from activation statistics. The approach enables immediate deployment, with training-free configurations offering substantial speedups and near-dense perplexity at higher activation ratios, and lightweight LoRA fine-tuning restoring most downstream accuracy. The method scales to larger models and provides a practical deployment path for resource-constrained settings, with public code available for replication.

Abstract

Scaling large language models (LLMs) improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70\% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75\%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5\% acceleration. Further experiments reveal that a CMoE configuration activating just 25\% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76\% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.

CMoE: Converting Mixture-of-Experts from Dense to Accelerate LLM Inference

TL;DR

CMoE tackles the high inference cost of large language models by converting dense FFNs into Mixture-of-Experts without training. It achieves this by profiling FFN activations to separate shared (always-active) and routed (sparsely-active) neurons, then forming routed experts through balanced clustering and deriving a differentiable router from activation statistics. The approach enables immediate deployment, with training-free configurations offering substantial speedups and near-dense perplexity at higher activation ratios, and lightweight LoRA fine-tuning restoring most downstream accuracy. The method scales to larger models and provides a practical deployment path for resource-constrained settings, with public code available for replication.

Abstract

Scaling large language models (LLMs) improves performance but dramatically increases inference costs. The feed-forward network (FFN), consuming approximately 70\% of inference compute, represents a critical bottleneck, particularly in large batch size scenarios. While mixture-of-experts (MoE) architectures leverage activation sparsity for efficiency, converting existing dense models to MoEs traditionally requires resource-intensive continual pre-training. We present CMoE, a framework that rapidly transforms dense LLMs into MoEs without training. The key innovation lies in analyzing FFN neuron activations to partition them into shared (always active) and routed experts. Routed neurons are clustered using a balanced assignment algorithm, and a differentiable router is constructed analytically from activation statistics, enabling immediate deployment or optional lightweight fine-tuning. Experiments demonstrate that, with activation ratio of 75\%, it achieves remarkable results, delivering lossless precision in terms of perplexity while still maintaining a 5\% acceleration. Further experiments reveal that a CMoE configuration activating just 25\% of parameters reduces end-to-end latency by 1.5x while preserving usable perplexity without additional training. Moreover, a brief LoRA fine-tuning process (requiring only 1 hour and 2,000 samples) successfully recovers over 76\% of the dense model's downstream accuracy. By effectively balancing performance and efficiency, CMoE offers a viable path forward for deploying LLMs in real-world scenarios where computational resources are limited. We make our code publicly available at https://github.com/JarvisPei/CMoE.

Paper Structure

This paper contains 14 sections, 27 equations, 4 figures, 7 tables.

Figures (4)

  • Figure 1: The overview of our proposed CMoE.
  • Figure 2: Trade-off between Model Performance (PPL) and Construction Time with Increasing Fine-tuning Data (WikiText-2 samples, $25\%$ activation).
  • Figure 3: Effect of Load Balancing on expert utilization in Llama-2 7B final block ($25\%$ activation).
  • Figure A.1: The proposed framework, CMoE, and the numerical analysis on its effectiveness.