Table of Contents
Fetching ...

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang, Zichun Yu, Chenyan Xiong

TL;DR

FLAME-MoE tackles the scarcity of transparent, end-to-end MoE research platforms by releasing a family of seven decoder-only MoE models (38M–1.7B active params) with 64 experts per layer and full training artifacts. It introduces a principled scaling-law framework (IsoFLOP and parametric loss) to derive compute-optimal configurations and demonstrates consistent improvements over dense baselines under identical FLOPs across six downstream tasks. The work provides in-depth analyses of expert specialization, co-activation sparsity, and routing saturation, showing that experts specialize early, activations remain sparse, and routing stabilizes early in training. By offering complete openness and tooling, FLAME-MoE aims to catalyze reproducible, cross-scale MoE research and accelerate understanding of sparse language models with practical implications for scalable, efficient deployment.

Abstract

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

TL;DR

FLAME-MoE tackles the scarcity of transparent, end-to-end MoE research platforms by releasing a family of seven decoder-only MoE models (38M–1.7B active params) with 64 experts per layer and full training artifacts. It introduces a principled scaling-law framework (IsoFLOP and parametric loss) to derive compute-optimal configurations and demonstrates consistent improvements over dense baselines under identical FLOPs across six downstream tasks. The work provides in-depth analyses of expert specialization, co-activation sparsity, and routing saturation, showing that experts specialize early, activations remain sparse, and routing stabilizes early in training. By offering complete openness and tooling, FLAME-MoE aims to catalyze reproducible, cross-scale MoE research and accelerate understanding of sparse language models with practical implications for scalable, efficient deployment.

Abstract

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

Paper Structure

This paper contains 22 sections, 9 equations, 13 figures, 5 tables.

Figures (13)

  • Figure 1: Scaling law experiments: (a) IsoFLOP profiles; (b) parametric loss function fitting; (c) fitted scaling law; (d) generalization from validation loss to downstream performance.
  • Figure 2: Downstream comparison between FLAME-MoE and dense models during pretraining.
  • Figure 3: Training efficiency of FLAME-MoE-1.7B-10.3B under different parallelization strategies (EP = Expert Parallel, PP = Pipeline Parallel). Dense-1.4B is also included here as a comparison.
  • Figure 4: Evolution of specialization scores for the top-2 most specialized tokens across Experts 0, 1, 6, 9 at the final layer in FLAME-MoE-1.7B-10.3B on the validation set.
  • Figure 5: Expert co-activation in FLAME-MoE-1.7B-10.3B at the final checkpoint on the validation set. The heatmap shows pairwise co-activation scores among the 16 experts with the highest co-activation across layers 2, 6, 12, and 18. Expert IDs are shown on the axes.
  • ...and 8 more figures