Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo; Ye Han; Pingzhi Li; Jiayin Qin; Jie Peng; Yang; Zhao; Yu; Cao; Tianlong Chen

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Shuqing Luo, Ye Han, Pingzhi Li, Jiayin Qin, Jie Peng, Yang, Zhao, Yu, Cao, Tianlong Chen

TL;DR

Mozart is proposed, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures that exploits the inherent modularity of chiplets and introduces an expert allocation strategy that enables efficient on-package all-to-all communication.

Abstract

Mixture-of-Experts (MoE) architecture offers enhanced efficiency for Large Language Models (LLMs) with modularized computation, yet its inherent sparsity poses significant hardware deployment challenges, including memory locality issues, communication overhead, and inefficient computing resource utilization. Inspired by the modular organization of the human brain, we propose Mozart, a novel algorithm-hardware co-design framework tailored for efficient training of MoE-based LLMs on 3.5D wafer-scale chiplet architectures. On the algorithm side, Mozart exploits the inherent modularity of chiplets and introduces: (1) an expert allocation strategy that enables efficient on-package all-to-all communication, and (2) a fine-grained scheduling mechanism that improves communication-computation overlap through streaming tokens and experts. On the architecture side, Mozart adaptively co-locates heterogeneous modules on specialized chiplets with a 2.5D NoP-Tree topology and hierarchical memory structure. Evaluation across three popular MoE models demonstrates significant efficiency gains, enabling more effective parallelization and resource utilization for large-scale modularized MoE-LLMs.

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

TL;DR

Abstract

Paper Structure (43 sections, 7 equations, 16 figures, 4 tables, 1 algorithm)

This paper contains 43 sections, 7 equations, 16 figures, 4 tables, 1 algorithm.

Introduction
Related Works
Modularized LLMs.
2.5D/3.5D Chiplet for ML Workloads.
Preliminary
Mixture-of-Experts
Formulation.
Expert Parallelism Pipeline.
Analyzing Expert Activation Prior
Analyzing Workload Distribution across Individual Experts.
Analyzing Collaboration Pattern across Paired Experts.
Efficient All-to-All Communication
Methodology
Overview of Mozart
Expert Collaboration for Efficient On-Package All-to-All Communication
...and 28 more sections

Figures (16)

Figure 1: Parameter distribution in modern MoE-LLMs across various scales. The routed experts module constitutes over $90\%$ of the total parameters in these architectures.
Figure 2: Algorithm-Hardware Co-Design Diagram of Mozart. $\mathtt{Mozart}$ provides an algorithm-hardware co-design approach, and we present both the algorithm-level expert clustering & allocation schemes in the left part, and the architecture-level 3.5D chiplet system in the right part. The MoE-LLM parameters are modularized in each decoder layer and mapped to the individual chiplets.
Figure 3: Left: Activation frequency for pre-trained DeepSeek-MoE, indicating expert specialization. Right: Co-activation pattern for pre-trained DeepSeek-MoE, indicating expert collaboration.
Figure 4: Fine-Grained scheduling pipeline in the forward pass. The streaming tokens, marked with the execution order, can effectively overlap the computation (purple blocks) and DRAM communication (pink blocks, saving activations). We present $3$ types of chiplets in the training pipeline, including attention chiplet, highly-activated chiplet, and less-activated chiplet. Since the $2$ MoE chiplets share the same DRAM I/O, the highly activated experts should be first loaded to the chiplet for better communication-computation overlap.
Figure 5: The overall 3.5D chiplet architecture in $\mathtt{Mozart}$. The hardware architecture implements a three-layer hierarchical tree topology, comprising a central attention node, switch nodes, and peripheral MoE nodes. The two-tier dies are connected face-to-face.
...and 11 more figures

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

TL;DR

Abstract

Mozart: Modularized and Efficient MoE Training on 3.5D Wafer-Scale Chiplet Architectures

Authors

TL;DR

Abstract

Table of Contents

Figures (16)