Table of Contents
Fetching ...

LLaDA-MoE: A Sparse MoE Diffusion Language Model

Fengqi Zhu, Zebin You, Yipeng Xing, Zenan Huang, Lin Liu, Yihong Zhuang, Guoshan Lu, Kangyu Wang, Xudong Wang, Lanning Wei, Hongrui Guo, Jiaqi Hu, Wentao Ye, Tieyuan Chen, Chenchen Li, Chengfu Tang, Haibo Feng, Jun Hu, Jun Zhou, Xiaolu Zhang, Zhenzhong Lan, Junbo Zhao, Da Zheng, Chongxuan Li, Jianguo Li, Ji-Rong Wen

TL;DR

The paper tackles the high computational cost of diffusion language models by introducing LLaDA-MoE, a sparse MoE diffusion model trained from scratch on ~20T tokens that activates only 1.4B parameters during inference. It demonstrates that a large, sparse MoE backbone can surpass previous dense diffusion models and, after instruction tuning, approaches the performance of Qwen2.5-3B-Instruct across diverse tasks. The authors provide a multi-stage training pipeline, include a variable-length training technique to reduce train/test mismatch, and employ top-k MoE routing with load-balancing mechanisms to maintain efficiency. The work validates sparse MoE as an effective approach for efficient diffusion modeling and suggests substantial potential for scaling and future improvements.

Abstract

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

LLaDA-MoE: A Sparse MoE Diffusion Language Model

TL;DR

The paper tackles the high computational cost of diffusion language models by introducing LLaDA-MoE, a sparse MoE diffusion model trained from scratch on ~20T tokens that activates only 1.4B parameters during inference. It demonstrates that a large, sparse MoE backbone can surpass previous dense diffusion models and, after instruction tuning, approaches the performance of Qwen2.5-3B-Instruct across diverse tasks. The authors provide a multi-stage training pipeline, include a variable-length training technique to reduce train/test mismatch, and employ top-k MoE routing with load-balancing mechanisms to maintain efficiency. The work validates sparse MoE as an effective approach for efficient diffusion modeling and suggests substantial potential for scaling and future improvements.

Abstract

We introduce LLaDA-MoE, a large language diffusion model with the Mixture-of-Experts (MoE) architecture, trained from scratch on approximately 20T tokens. LLaDA-MoE achieves competitive performance with significantly reduced computational overhead by maintaining a 7B-parameter capacity while activating only 1.4B parameters during inference. Our empirical evaluation reveals that LLaDA-MoE achieves state-of-the-art performance among diffusion language models with larger parameters, surpassing previous diffusion language models LLaDA, LLaDA 1.5, and Dream across multiple benchmarks. The instruct-tuned model LLaDA-MoE-7B-A1B-Instruct demonstrates capabilities comparable to Qwen2.5-3B-Instruct in knowledge understanding, code generation, mathematical reasoning, agent and alignment tasks, despite using fewer active parameters. Our results show that integrating a sparse MoE architecture into the training objective of masked diffusion language models still brings out MoE's strengths under efficient inference with few active parameters, and opens ample room for further exploration of diffusion language models. LLaDA-MoE models are available at Huggingface.

Paper Structure

This paper contains 9 sections, 6 equations, 4 figures, 3 tables.

Figures (4)

  • Figure 1: Benchmark results. We compare LLaDA-MoE with larger MDMs and Qwen2.5-3B-Instruct across key tasks in knowledge, reasoning, mathematics, coding, and agent tasks. Despite using fewer activated parameters, LLaDA-MoE consistently outperforms other diffusion language models and achieves performance comparable to Qwen2.5-3B-Instruct.
  • Figure 2: Overview of the generation process and architecture.Left: The iterative generation process from fully masked ($t=1$) to fully unmasked ($t=0$). Blue blocks are fixed user prompt tokens, green blocks are mask tokens. The model iteratively predicts and remasks tokens until generation completes. Right: The MoE architecture with router selecting top-$2$ experts per token. The histogram shows expert routing distribution, and outputs are weighted combinations of selected experts, enabling efficient sparse activation.
  • Figure 3: Training pipeline. LLaDA-MoE is trained through Pretrain stage 1 (10T tokens), pretrain stage 2 (10T tokens), annealing stage 1 (500B tokens), annealing stage 2 (500B tokens with 8k context length), followed by SFT on curated prompt–answer pairs.
  • Figure 4: Training dynamics of auxiliary losses over training tokens. LLaDA‑MoE pre‑training results over the first 1T tokens. Left: Z‑Loss; right: Load‑Balancing Loss.