Table of Contents
Fetching ...

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

Yu Han, Lehan Pan, Jie Peng, Ziyang Tao, Wuyang Zhang, Yanyong Zhang

TL;DR

Sparse Mixture-of-Experts enables large parameter counts but requires distributed inference across GPUs, introducing cross-device communication and load imbalance bottlenecks. GRACE-MoE jointly optimizes communication and computation through offline non-uniform affinity-based grouping, dynamic replication of hot experts, and online locality-aware routing guided by load prediction, enabling efficient multi-node SMoE inference. Key contributions include a spectral-clustering-based non-uniform grouping strategy with controlled size deviation, a dynamic replication scheme based on load skew, and a topology-aware routing policy that prioritizes local replicas while balancing remote load. Across three representative MoE models and multi-node clusters, GRACE-MoE achieves up to 3.79x end-to-end speedups and strong scalability without accuracy loss, offering a practical path to scalable deployment of large SMoE models.

Abstract

Sparse Mixture of Experts (SMoE) performs conditional computation by selectively activating a subset of experts, thereby enabling scalable parameter growth in large language models (LLMs). However, the expanded parameter scale exceeds the memory capacity of a single device, necessitating distributed deployment for inference. This setup introduces two critical challenges: (1) Communication Issue: Transferring features to devices with activated experts leads to significant communication overhead. (2) Computational Load Issue: Skewed expert activation overloads certain GPUs, resulting in load imbalance across devices. Among these, communication overhead is identified as the main bottleneck in SMoE inference. Nevertheless, reducing communication between devices may exacerbate computational load imbalance, leading to device idleness and resource waste. Therefore, we present GRACE-MoE, short for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a co-optimization framework that jointly reduces communication overhead and alleviates computational load imbalance. Specifically, the framework comprises two key phases: (1) Grouping & Replication: This phase groups experts based on their affinity to reduce cross-device communication. Additionally, dynamic replication is applied to address load skew, improving computational load balance across GPUs. (2) Routing: This phase employs a locality-aware routing strategy with load prediction. It prioritizes local replicas to minimize communication overhead and balances requests across remote replicas when necessary. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 3.79x speedup over state-of-the-art systems. Code for GRACE-MoE will be released upon acceptance.

GRACE-MoE: Grouping and Replication with Locality-Aware Routing for Efficient Distributed MoE Inference

TL;DR

Sparse Mixture-of-Experts enables large parameter counts but requires distributed inference across GPUs, introducing cross-device communication and load imbalance bottlenecks. GRACE-MoE jointly optimizes communication and computation through offline non-uniform affinity-based grouping, dynamic replication of hot experts, and online locality-aware routing guided by load prediction, enabling efficient multi-node SMoE inference. Key contributions include a spectral-clustering-based non-uniform grouping strategy with controlled size deviation, a dynamic replication scheme based on load skew, and a topology-aware routing policy that prioritizes local replicas while balancing remote load. Across three representative MoE models and multi-node clusters, GRACE-MoE achieves up to 3.79x end-to-end speedups and strong scalability without accuracy loss, offering a practical path to scalable deployment of large SMoE models.

Abstract

Sparse Mixture of Experts (SMoE) performs conditional computation by selectively activating a subset of experts, thereby enabling scalable parameter growth in large language models (LLMs). However, the expanded parameter scale exceeds the memory capacity of a single device, necessitating distributed deployment for inference. This setup introduces two critical challenges: (1) Communication Issue: Transferring features to devices with activated experts leads to significant communication overhead. (2) Computational Load Issue: Skewed expert activation overloads certain GPUs, resulting in load imbalance across devices. Among these, communication overhead is identified as the main bottleneck in SMoE inference. Nevertheless, reducing communication between devices may exacerbate computational load imbalance, leading to device idleness and resource waste. Therefore, we present GRACE-MoE, short for Grouping and Replication with Locality-Aware Routing for SMoE inference. GRACE-MoE is a co-optimization framework that jointly reduces communication overhead and alleviates computational load imbalance. Specifically, the framework comprises two key phases: (1) Grouping & Replication: This phase groups experts based on their affinity to reduce cross-device communication. Additionally, dynamic replication is applied to address load skew, improving computational load balance across GPUs. (2) Routing: This phase employs a locality-aware routing strategy with load prediction. It prioritizes local replicas to minimize communication overhead and balances requests across remote replicas when necessary. Experiments on diverse models and multi-node, multi-GPU environments demonstrate that GRACE-MoE efficiently reduces end-to-end inference latency, achieving up to 3.79x speedup over state-of-the-art systems. Code for GRACE-MoE will be released upon acceptance.

Paper Structure

This paper contains 20 sections, 3 equations, 6 figures, 1 table, 4 algorithms.

Figures (6)

  • Figure 1: Grouping strictness and replication strategies. Experiments on OLMoE with 2 nodes $\times$ 2 GPUs/node, metrics reported in tokens. (a) Relaxing grouping strictness reduces communication compared to Vanilla. (b) Replicating highly activated experts alleviates load imbalance more effectively than replicating widely collaborative experts, relative to Hierarchical Grouping (HG).
  • Figure 2: Overview of GRACE-MoE. (a) Profiling expert selections to build affinity matrices. (b) Grouping high-affinity experts on the same device and dynamically replicating hot experts to balance computational load. (c) Adaptive routing reduces communication by prioritizing local replicas and balances requests via weighted round-robin with load prediction across remote replicas.
  • Figure 3: Computational load distribution after hierarchical grouping. (a) Sampled layers show that affinity clustering concentrates load only on a few groups. (b) In Layer 5, per-expert loads in the heaviest group reveal that overload comes from a few frequently activated experts, not all.
  • Figure 4: End-to-end inference latency and MoE layer time. Evaluation of GRACE-MoE and all baselines across three models with batch size = 128, prefill length = 64, and decode length = 16.
  • Figure 5: Component analysis. Grouping, replication and routing schemes are compared across three models under a 2 node $\times$ 2 GPUs/node setup on the WikiText dataset. Abbreviations: Vanilla (Average Grouping), HG (Hierarchical Grouping), FR (Fixed-Count Replication), DR (Dynamic-Count Replication), WRR (Weighted Round-Robin with Load Prediction), TAR (Topology-Aware Routing with Locality Preference).
  • ...and 1 more figures