Table of Contents
Fetching ...

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

Qian Chen, Xianhao Chen, Kaibin Huang

TL;DR

This work tackles latency-aware caching for distributed MoE inference at the network edge under storage constraints. It introduces SlimCaching, which caches a user’s frequently activated experts locally and distributes the rest across edge servers via Top-$K$ routing to minimize average per-token latency. For the special case $K=1$, a greedy algorithm achieves a $(1-1/e)$-approximation; for general $K\ge1$, a successive greedy decomposition with a DP-based subproblem and an accelerated max-convolution method provides a constant-approximation guarantee, supported by theoretical analysis. Experiments on SQA and VQA datasets show sizable latency reductions compared with baselines and favorable running-time characteristics, validating practicality for edge deployments.

Abstract

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top-$K$ expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When $K=1$, the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a $(1 - 1/e)$-approximation guarantee. For the general case where $K \geq 1$, expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.

SlimCaching: Edge Caching of Mixture-of-Experts for Distributed Inference

TL;DR

This work tackles latency-aware caching for distributed MoE inference at the network edge under storage constraints. It introduces SlimCaching, which caches a user’s frequently activated experts locally and distributes the rest across edge servers via Top- routing to minimize average per-token latency. For the special case , a greedy algorithm achieves a -approximation; for general , a successive greedy decomposition with a DP-based subproblem and an accelerated max-convolution method provides a constant-approximation guarantee, supported by theoretical analysis. Experiments on SQA and VQA datasets show sizable latency reductions compared with baselines and favorable running-time characteristics, validating practicality for edge deployments.

Abstract

Mixture-of-Experts (MoE) models improve the scalability of large language models (LLMs) by activating only a small subset of relevant experts per input. However, the sheer number of expert networks in an MoE model introduces a significant storage burden for an edge device. To address this challenge, we consider a scenario where experts are dispersed across an edge network for distributed inference. Based on the popular Top- expert selection strategy, we formulate a latency minimization problem by optimizing expert caching on edge servers under storage constraints. When , the problem reduces to a monotone submodular maximization problem with knapsack constraints, for which we design a greedy-based algorithm with a -approximation guarantee. For the general case where , expert co-activation within the same MoE layer introduces non-submodularity, which renders greedy methods ineffective. To tackle this issue, we propose a successive greedy decomposition method to decompose the original problem into a series of subproblems, with each being solved by a dynamic programming approach. Furthermore, we design an accelerated algorithm based on the max-convolution technique to obtain the approximate solution with a provable guarantee in polynomial time. Simulation results on various MoE models demonstrate that our method significantly reduces inference latency compared to existing baselines.

Paper Structure

This paper contains 32 sections, 8 theorems, 37 equations, 11 figures, 1 table, 3 algorithms.

Key Result

Proposition 1

For Top-$K$ strategy, when $K_m =1$ for any $m \in \mathcal{M}$, $\mathcal{P}1$ is a monotone non-decreasing submodular maximization problem with $N$ knapsack constraints.

Figures (11)

  • Figure 1: Comparison of the average per-token communication latency of the proposed SlimCaching framework and the U-shaped SI scheme across different device storage capacities in a scenario consisting of a single user, a single edge server, and the cloud. The expert activation statistics are computed from the prompts of the Visual Question Answering (VQA) v2 dataset, and “ST-b-X” denotes a Switch Transformer–based MoE model with X experts per MoE layer. The storage capacity of the edge server is set to 1.5 GB and other simulation parameters follow the settings described in Section \ref{['sec:experiment']}.
  • Figure 2: Visualization of activated experts in different MoE layers of the text part in MoE-LLaVA-Phi2-2.7B-4e with Top-2 strategy lin2024moe under the Science Question Answering (SQA) dataset lu2022learn.
  • Figure 3: Illustration of an MoE architecture with $E$ expert networks in each MoE layer.
  • Figure 4: SlimCaching in distributed wireless systems. (a) Illustration of local cache hit, edge cache hit, and local/edge cache miss. (b) Operations of SlimCaching within an MoE layer.
  • Figure 5: Different cases of hidden-state routing when the user token’s hidden state activates the expert group $\left\{ i_m^{\left( \ell \right)},j_m^{\left( \ell \right)} \right\}$ under the Top-2 strategy.
  • ...and 6 more figures

Theorems & Definitions (22)

  • Definition 1: Submodular Function
  • Definition 2: Supermodular Function
  • Proposition 1
  • proof
  • Proposition 2
  • proof
  • Proposition 3
  • proof
  • Proposition 4
  • proof
  • ...and 12 more