Table of Contents
Fetching ...

CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Adrian Zhao, Zhenkun Cai, Zhenyu Song, Lingfan Yu, Haozheng Fan, Jun Wu, Yida Wang, Nandita Vijaykumar

Abstract

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by $1.14\times$ on average (up to $1.2\times$) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

CRAFT: Cost-aware Expert Replica Allocation with Fine-Grained Layerwise Estimations

Abstract

Mixture-of-Experts (MoE) has recently emerged as the mainstream architecture for efficiently scaling large language models while maintaining near-constant computational cost. Expert parallelism distributes parameters by partitioning experts across devices, but this introduces token-level load imbalance during inference. Expert replication is a widely adopted load-balancing technique in serving frameworks that alleviates load imbalance in large-scale deployments by replicating experts with high loads. In this work, we demonstrate that existing replication schemes often over-replicate, with many replicas providing marginal improvement. Replicas consume substantial GPU memory, which may lead to resource contention and throughput degradation. We present CRAFT, an efficient expert replication framework that maximizes load balance under a given memory budget by performing fine-grained, per-layer replication based on the estimated replication benefit. CRAFT can be seamlessly integrated into existing serving frameworks without any additional training or model changes. Our evaluation shows that CRAFT increases end-to-end serving throughput by on average (up to ) over existing replication techniques in large-scale deployments with models ranging from hundreds of billions to a trillion parameters.

Paper Structure

This paper contains 28 sections, 11 figures, 3 algorithms.

Figures (11)

  • Figure 1: Balancedness gain (orange) and throughput gain (blue) of CRAFT on Kimi-K2-1000B deployed on 64 GPUs, normalized to the baseline with only expert placement (refer to KE8 configuration in Section \ref{['sec:setup']}). EPLB is configured with 60 replicas per GPU on Kimi-K2 with 60 MoE layers (at minimum one replica per layer per GPU). The X-axis (expert replicas per GPU) is presented on a base-2 logarithmic scale.
  • Figure 2: Expert load distribution under different EP optimizations on an MoE layer with 8 experts. Color density represents the number of tokens assigned to the expert (expert load) during inference, and dotted experts represent replicas allocated during replication.
  • Figure 3: Breakdown of the final post-replication balancedness by each technique's contribution across various configurations. The total height of a bar represents the GPU balancedness in an MoE layer with expert replication. A technique constituting a higher portion of the bar implies higher contribution to the overall balancedness. Layers excluded in the figures are non-MoE dense layers.
  • Figure 4: Average expert load distribution of KE8. Total load is identical on both layers; the red line marks the average load.
  • Figure 5: Load balancedness under varying per-layer replica counts across configurations. Balancedness is aggregated across all MoE layers. Each layer is allocated the same number of replicas; zero denotes the placement-only baseline. $\times$ indicates the minimum uniform replication (one replica per layer per GPU, Section \ref{['sec:placerep']}).
  • ...and 6 more figures