Table of Contents
Fetching ...

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

Yun Wang, Lingyun Yang, Senhao Yu, Yixiao Wang, Ruixing Li, Zhixiang Wei, James Yen, Zhengwei Qi

TL;DR

BuddyMoE addresses the memory bottleneck in inference for large Mixture-of-Experts models by exploiting functional redundancy among experts. It identifies functionally similar 'buddy' experts offline via co-activation analyses and uses a three-metric runtime policy to substitute missing GPU-resident experts with buddies, avoiding costly PCIe transfers. This approach yields substantial throughput gains (up to about 10% in the reported setup) with minimal accuracy degradation, and reduces PCIe bandwidth pressure by avoiding unnecessary data movement. The method enables more practical deployment of memory-hungry MoE models on memory-constrained hardware, providing a robust fallback for prefetch misses and complementing existing offloading and prefetching techniques.

Abstract

Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.

BuddyMoE: Exploiting Expert Redundancy to Accelerate Memory-Constrained Mixture-of-Experts Inference

TL;DR

BuddyMoE addresses the memory bottleneck in inference for large Mixture-of-Experts models by exploiting functional redundancy among experts. It identifies functionally similar 'buddy' experts offline via co-activation analyses and uses a three-metric runtime policy to substitute missing GPU-resident experts with buddies, avoiding costly PCIe transfers. This approach yields substantial throughput gains (up to about 10% in the reported setup) with minimal accuracy degradation, and reduces PCIe bandwidth pressure by avoiding unnecessary data movement. The method enables more practical deployment of memory-hungry MoE models on memory-constrained hardware, providing a robust fallback for prefetch misses and complementing existing offloading and prefetching techniques.

Abstract

Mixture-of-Experts (MoE) architectures scale language models by activating only a subset of specialized expert networks for each input token, thereby reducing the number of floating-point operations. However, the growing size of modern MoE models causes their full parameter sets to exceed GPU memory capacity; for example, Mixtral-8x7B has 45 billion parameters and requires 87 GB of memory even though only 14 billion parameters are used per token. Existing systems alleviate this limitation by offloading inactive experts to CPU memory, but transferring experts across the PCIe interconnect incurs significant latency (about 10 ms). Prefetching heuristics aim to hide this latency by predicting which experts are needed, but prefetch failures introduce significant stalls and amplify inference latency. In the event of a prefetch failure, prior work offers two primary solutions: either fetch the expert on demand, which incurs a long stall due to the PCIe bottleneck, or drop the expert from the computation, which significantly degrades model accuracy. The critical challenge, therefore, is to maintain both high inference speed and model accuracy when prefetching fails.

Paper Structure

This paper contains 33 sections, 6 equations, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Model size is scaling substantially faster than single-accelerator memory (2017–2025). Left (log) axis: relative model size; right axis: relative device memory, illustrating the widening gap.
  • Figure 2: Architectural comparison between a standard Transformer block (a) and a MoE block (b). The MoE architecture replaces the single, dense FFN with a pool of experts, only a subset of which are activated per token.
  • Figure 3: Overview of the expert prefetching pipeline. While the GPU computes block i, the CPU uses the attention output to predict the required experts for the next block (i+1) and prefetches them. This overlaps I/O with computation to hide latency. A verification step is used to handle prediction mismatches.
  • Figure 4: Expert similarity heatmap for a 64-expert MoE model. The intensity represents the functional similarity between expert pairs, with brighter regions indicating higher similarity. The prevalent bright areas demonstrate significant redundancy across experts, suggesting opportunities for expert substitution during cache misses.
  • Figure 5: Overview of the buddy expert replacement system. The left panel shows the experts co-activation matrix derived from profiling data, where darker cells indicate higher co-activation frequency between expert pairs. The right panel illustrates the buddy replacing mechanism: (a) offline profiling identifies frequently co-activated experts with functional redundancy, and (b) runtime buddy replacing selectively substitutes GPU-resident experts with their CPU-based buddies based on activation patterns, token sensitivity, and expert distribution. The system dynamically determines replacement decisions through three key metrics (§3.1-§3.3) to minimize memory transfer overhead while maintaining model accuracy.
  • ...and 4 more figures