Mixture of Diverse Size Experts

Manxi Sun; Wei Liu; Jian Luan; Pengzhi Gao; Bin Wang

Mixture of Diverse Size Experts

Manxi Sun, Wei Liu, Jian Luan, Pengzhi Gao, Bin Wang

TL;DR

MoDSE addresses the limitation of uniform expert sizes in sparse MoE transformers by introducing diverse-size FFN experts paired to preserve total parameter budgets. The approach employs an expert-pair allocation strategy and a load-balance loss to ensure even workload distribution across GPUs. Empirical results on 300M×8 and 700M×8 models show MoDSE achieves better downstream performance and faster convergence than homogeneous MoE baselines, while token routing becomes more balanced and difficult tokens are handled by appropriately sized experts. The work demonstrates that allocating parameter budgets adaptively across differently sized experts improves auto-regressive token generation without increasing total compute, though it acknowledges resource-constrained evaluation and tokenizer openness as limitations.

Abstract

The Sparsely-Activated Mixture-of-Experts (MoE) has gained increasing popularity for scaling up large language models (LLMs) without exploding computational costs. Despite its success, the current design faces a challenge where all experts have the same size, limiting the ability of tokens to choose the experts with the most appropriate size for generating the next token. In this paper, we propose the Mixture of Diverse Size Experts (MoDSE), a new MoE architecture with layers designed to have experts of different sizes. Our analysis of difficult token generation tasks shows that experts of various sizes achieve better predictions, and the routing path of the experts tends to be stable after a training period. However, having experts of diverse sizes can lead to uneven workload distribution. To tackle this limitation, we introduce an expert-pair allocation strategy to evenly distribute the workload across multiple GPUs. Comprehensive evaluations across multiple benchmarks demonstrate the effectiveness of MoDSE, as it outperforms existing MoEs by allocating the parameter budget to experts adaptively while maintaining the same total parameter size and the number of experts.

Mixture of Diverse Size Experts

TL;DR

Abstract

Paper Structure (22 sections, 13 equations, 4 figures, 9 tables)

This paper contains 22 sections, 13 equations, 4 figures, 9 tables.

Introduction
Preliminaries: Mixture of Experts
MoDSE Architecture
Diverse Size Experts
Load Balance Consideration
Experiments
Experimental Setup
Models
Training configurations
Datasets
Main Results
Training convergence
Decoding efficiency
Analysis on Token Routing
Analysis on Difficult Tokens
...and 7 more sections

Figures (4)

Figure 1: Overview of a MoDSE layer with different sizes of experts. In this case, expert1_0 and expert2_0 are selected. With the output of the gating network, the outputs of two experts are integrated.
Figure 2: Training and validation loss curves for the $300M \times 8$ and $700M \times 8$ models, with cross-entropy loss values indicated on the curves.
Figure 3: The number of tokens routed to each expert. The bar is the sum of the number across the layers. Figure (a) shows results in Baseline in epoch 2, and (b) in the last epoch. Figure (c) shows results in MoDSE in epoch 2, and (d) in the last epoch. The purple bar indicates the most routed expert, and the yellow indicates the least.
Figure 4: The top one expert choice of difficult tokens across eight layers. More tokens are routed to larger experts, distributed on the left half of the heat map.

Mixture of Diverse Size Experts

TL;DR

Abstract

Mixture of Diverse Size Experts

Authors

TL;DR

Abstract

Table of Contents

Figures (4)