Table of Contents
Fetching ...

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang, Zhao, Yu, Cao, Tianlong Chen

TL;DR

Mozart is proposed, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures that exploits the inherent modularity of chiplets and introduces an expert allocation strategy that enables efficient on-package all-to-all communication.

Abstract

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

TL;DR

Mozart is proposed, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures that exploits the inherent modularity of chiplets and introduces an expert allocation strategy that enables efficient on-package all-to-all communication.

Abstract

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.
Paper Structure (43 sections, 7 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 43 sections, 7 equations, 16 figures, 4 tables, 1 algorithm.

Figures (16)

  • Figure 1: Parameter distribution in modern MoE-LLMs across various scales. The routed experts module constitutes over $90\%$ of the total parameters in these architectures.
  • Figure 2: Algorithm-Hardware Co-Design Diagram of Mozart. $\mathtt{Mozart}$ provides an algorithm-hardware co-design approach, and we present both the algorithm-level expert clustering & allocation schemes in the left part, and the architecture-level 3.5D chiplet system in the right part. The MoE-LLM parameters are modularized in each decoder layer and mapped to the individual chiplets.
  • Figure 3: Left: Activation frequency for pre-trained DeepSeek-MoE, indicating expert specialization. Right: Co-activation pattern for pre-trained DeepSeek-MoE, indicating expert collaboration.
  • Figure 4: Fine-Grained scheduling pipeline in the forward pass. The streaming tokens, marked with the execution order, can effectively overlap the computation (purple blocks) and DRAM communication (pink blocks, saving activations). We present $3$ types of chiplets in the training pipeline, including attention chiplet, highly-activated chiplet, and less-activated chiplet. Since the $2$ MoE chiplets share the same DRAM I/O, the highly activated experts should be first loaded to the chiplet for better communication-computation overlap.
  • Figure 5: The overall 3.5D chiplet architecture in $\mathtt{Mozart}$. The hardware architecture implements a three-layer hierarchical tree topology, comprising a central attention node, switch nodes, and peripheral MoE nodes. The two-tier dies are connected face-to-face.
  • ...and 11 more figures