Table of Contents
Fetching ...

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

Nam V. Nguyen, Thong T. Doan, Luong Tran, Van Nguyen, Quang Pham

TL;DR

LibMoE addresses the bottleneck of reproducible, large-scale MoE research under limited compute by providing a modular framework that integrates seven SMoE algorithms, standardized training pipelines for both language and vision–language models, and zero-shot evaluation across diverse benchmarks. The study reveals that no single MoE variant universally dominates; routing dynamics, initialization, and training regime (pretraining vs sparse upcycling) significantly influence stability, specialization, and efficiency. Through comprehensive diagnostics—routing entropy, router margins, expert co-activation, and load-balancing effects—LibMoE offers actionable insights for designing stable, scalable MoE systems and establishes a practical benchmark for future MoE research. The work thus lowers entry barriers, enables fair comparisons, and accelerates development of interpretable, efficient, and deployable MoE models in real-world settings.

Abstract

Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.

LIBMoE: A Library for comprehensive benchmarking Mixture of Experts in Large Language Models

TL;DR

LibMoE addresses the bottleneck of reproducible, large-scale MoE research under limited compute by providing a modular framework that integrates seven SMoE algorithms, standardized training pipelines for both language and vision–language models, and zero-shot evaluation across diverse benchmarks. The study reveals that no single MoE variant universally dominates; routing dynamics, initialization, and training regime (pretraining vs sparse upcycling) significantly influence stability, specialization, and efficiency. Through comprehensive diagnostics—routing entropy, router margins, expert co-activation, and load-balancing effects—LibMoE offers actionable insights for designing stable, scalable MoE systems and establishes a practical benchmark for future MoE research. The work thus lowers entry barriers, enables fair comparisons, and accelerates development of interpretable, efficient, and deployable MoE models in real-world settings.

Abstract

Mixture of experts (MoE) architectures have become a cornerstone for scaling up and are a key component in most large language models such as GPT-OSS, DeepSeek-V3, Llama-4, and Gemini-2.5. However, systematic research on MoE remains severely constrained by the prohibitive computational costs of training and evaluation, restricting large-scale studies accessible to most researchers. We introduce LibMoE, a unified framework for reproducible, efficient, and extensible MoE research that supports both pretraining and sparse-upcycling regimes. Beyond unified implementations, the framework provides transparent analytical tools for probing routing and expert dynamics. Leveraging this foundation, we conduct a comprehensive analysis along three dimensions: (i) routing dynamics, covering expert selection patterns, routing stability and optimality, and how routing entropy reveals task specialization and expert diversity; (ii) the effect of lightweight initialization on load balancing, demonstrating how subtle changes in router initialization shape early expert utilization; and (iii) training regime differences, revealing how sparse upcycling and full pretraining exhibit distinct routing patterns and stability profiles. By lowering the barrier to entry and standardizing evaluation, along with our comprehensive analysis, LibMoE broadens access to MoE research and establishes a reliable benchmark to guide future innovations. Project page: https://fsoft-aic.github.io/fsoft-LibMoE.github.io.

Paper Structure

This paper contains 48 sections, 6 equations, 25 figures, 9 tables.

Figures (25)

  • Figure 1: Overview of the LibMoE-VLM architecture and training process. In the first stage of Dense Training, only the MLP is trained to improve alignment. In the second stage, all parameters are trained. During SMoE Training, the feed-forward networks (FFNs) of the Vision Encoder (VE) and MLP Connector are used to initialize the experts within the SMoE framework, and all parameters continue to be trained.
  • Figure 2: Benchmark curves during training in between SMoE and Dense model on pre-training LLM.
  • Figure 3: Impact of Training Data Percentage on Expert Selection.
  • Figure 4: Effect of upcycled shared experts trained on prior tasks on routing behavior, measured by expert change rate during language model pretraining.
  • Figure 5: Performance of SMoE variant when changing top-1 expert to top-$(K+1)$ in vision-language modeling and language modeling task.
  • ...and 20 more figures