Table of Contents
Fetching ...

Mixture of Experts in Large Language Models

Danyang Zhang, Junhao Song, Ziqian Bi, Xinyuan Song, Yingfang Yuan, Tianyang Wang, Joe Yeong, Junfeng Hao

TL;DR

This survey tackles the challenge of scaling large language models without prohibitive compute by detailing Mixture-of-Experts (MoE) architectures that activate sparse subsets of experts per input. It covers foundational and advanced router designs, meta-learning and knowledge transfer mechanisms, and a broad portfolio of domain-specific MoE applications, from NLP and multimodal tasks to healthcare and vision. Key contributions include a unified taxonomy of MoE designs, bridge concepts between routing stability and deployment practicality, and the introduction of evaluation frameworks that account for accuracy, performance, and cost. The work highlights ongoing challenges such as expert diversity, routing robustness, and theoretical underpinnings, while outlining practical directions for building scalable, reliable, and adaptable MoE-based systems.

Abstract

This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.

Mixture of Experts in Large Language Models

TL;DR

This survey tackles the challenge of scaling large language models without prohibitive compute by detailing Mixture-of-Experts (MoE) architectures that activate sparse subsets of experts per input. It covers foundational and advanced router designs, meta-learning and knowledge transfer mechanisms, and a broad portfolio of domain-specific MoE applications, from NLP and multimodal tasks to healthcare and vision. Key contributions include a unified taxonomy of MoE designs, bridge concepts between routing stability and deployment practicality, and the introduction of evaluation frameworks that account for accuracy, performance, and cost. The work highlights ongoing challenges such as expert diversity, routing robustness, and theoretical underpinnings, while outlining practical directions for building scalable, reliable, and adaptable MoE-based systems.

Abstract

This paper presents a comprehensive review of the Mixture-of-Experts (MoE) architecture in large language models, highlighting its ability to significantly enhance model performance while maintaining minimal computational overhead. Through a systematic analysis spanning theoretical foundations, core architectural designs, and large language model (LLM) applications, we examine expert gating and routing mechanisms, hierarchical and sparse MoE configurations, meta-learning approaches, multimodal and multitask learning scenarios, real-world deployment cases, and recent advances and challenges in deep learning. Our analysis identifies key advantages of MoE, including superior model capacity compared to equivalent Bayesian approaches, improved task-specific performance, and the ability to scale model capacity efficiently. We also underscore the importance of ensuring expert diversity, accurate calibration, and reliable inference aggregation, as these are essential for maximizing the effectiveness of MoE architectures. Finally, this review outlines current research limitations, open challenges, and promising future directions, providing a foundation for continued innovation in MoE architecture and its applications.

Paper Structure

This paper contains 19 sections, 19 equations, 9 figures, 1 table.

Figures (9)

  • Figure 1: Timeline of mixture of experts (MoE) models development. The timeline shows key milestones in MoE architecture evolution from foundational concepts to modern large-scale implementations.
  • Figure 2: A comprehensive taxonomy of Mixture of Experts (MoE) models, organizing methodologies into seven key categories: language models, multimodal models, architectural innovations, training strategies, routing mechanisms, application scenarios, and challenges. Each category encompasses specific techniques, implementations, and representative models.
  • Figure 3: A brief illustration of sparsely gated Mixture of Experts (MoE) architecture on decoder only transformer. In this figure, the top-$k$ routing mechanism is configured with $k$=2, meaning the gating function selects the two highest-scoring FFN experts for each token based on the router's softmax probabilities. The selected experts are evaluated in parallel, with their outputs aggregated using weighted summation.
  • Figure 4: Comparison of routing strategies in Token Choice and Expert Choice Mixture-of-Experts architectures. (A) Token Choice routing: Each token is processed by selecting the most suitable experts based on computed affinity scores, with tokens "We" being routed to Expert 1 and Expert 3, while "Like" being routed to and Expert 3 and Expert4, respectively with their corresponding probability weights. (B) Expert Choice routing: Experts maintain fixed computational budgets and select their preferred tokens from the input sequence, where Expert 1 processes tokens ["We", "Love", "To", "Study"] and Expert 2 handles ["We", "Love", "Quite", "Library"], enabling balanced workload distribution across experts while allowing tokens to be processed by multiple experts when beneficial.
  • Figure 5: Architectural comparison between standard MoE and MixER layers. (A) Standard MoE architecture where input $x$ flows through the gating network to generate routing decisions, directing computation to selected experts via the router mechanism. (B) Enhanced MixER layer design that incorporates an additional context vector $\xi$ alongside input x for routing decisions. The gating network leverages both inputs to compute expert selection probabilities, while the MixER approach eliminates the traditional softmax-weighted output combination used in conventional MoE implementations
  • ...and 4 more figures