FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang; Zichun Yu; Chenyan Xiong

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

Hao Kang, Zichun Yu, Chenyan Xiong

TL;DR

FLAME-MoE tackles the scarcity of transparent, end-to-end MoE research platforms by releasing a family of seven decoder-only MoE models (38M–1.7B active params) with 64 experts per layer and full training artifacts. It introduces a principled scaling-law framework (IsoFLOP and parametric loss) to derive compute-optimal configurations and demonstrates consistent improvements over dense baselines under identical FLOPs across six downstream tasks. The work provides in-depth analyses of expert specialization, co-activation sparsity, and routing saturation, showing that experts specialize early, activations remain sparse, and routing stabilizes early in training. By offering complete openness and tooling, FLAME-MoE aims to catalyze reproducible, cross-scale MoE research and accelerate understanding of sparse language models with practical implications for scalable, efficient deployment.

Abstract

Recent large language models such as Gemini-1.5, DeepSeek-V3, and Llama-4 increasingly adopt Mixture-of-Experts (MoE) architectures, which offer strong efficiency-performance trade-offs by activating only a fraction of the model per token. Yet academic researchers still lack a fully open, end-to-end MoE platform for investigating scaling, routing, and expert behavior. We release FLAME-MoE, a completely open-source research suite composed of seven decoder-only models, ranging from 38M to 1.7B active parameters, whose architecture--64 experts with top-8 gating and 2 shared experts--closely reflects modern production LLMs. All training data pipelines, scripts, logs, and checkpoints are publicly available to enable reproducible experimentation. Across six evaluation tasks, FLAME-MoE improves average accuracy by up to 3.4 points over dense baselines trained with identical FLOPs. Leveraging full training trace transparency, we present initial analyses showing that (i) experts increasingly specialize on distinct token subsets, (ii) co-activation matrices remain sparse, reflecting diverse expert usage, and (iii) routing behavior stabilizes early in training. All code, training logs, and model checkpoints are available at https://github.com/cmu-flame/FLAME-MoE.

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

TL;DR

Abstract

FLAME-MoE: A Transparent End-to-End Research Platform for Mixture-of-Experts Language Models

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (13)