Table of Contents
Fetching ...

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi

TL;DR

This work tackles the challenge of deploying Mixture of Experts (MoE) models on memory-limited devices by introducing training-free cache-aware routing strategies to improve on-device throughput. It presents three methods—Max Rank, Cumulative Probability Threshold, and Cache-Prior—that bias expert selection toward those cached in DRAM while preserving accuracy. Across WikiText, MMLU, GSM8K, and on-device Android deployments, Cache-Prior delivers Pareto-dominant improvements in cache efficiency and, in many cases, accuracy, achieving up to 2x speedups with minimal performance loss. The results demonstrate robust, hardware-agnostic gains and offer a tunable latency-accuracy trade-off through a single parameter, enabling practical on-device MoE deployment at memory-constrained scales.

Abstract

Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.

Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference

TL;DR

This work tackles the challenge of deploying Mixture of Experts (MoE) models on memory-limited devices by introducing training-free cache-aware routing strategies to improve on-device throughput. It presents three methods—Max Rank, Cumulative Probability Threshold, and Cache-Prior—that bias expert selection toward those cached in DRAM while preserving accuracy. Across WikiText, MMLU, GSM8K, and on-device Android deployments, Cache-Prior delivers Pareto-dominant improvements in cache efficiency and, in many cases, accuracy, achieving up to 2x speedups with minimal performance loss. The results demonstrate robust, hardware-agnostic gains and offer a tunable latency-accuracy trade-off through a single parameter, enabling practical on-device MoE deployment at memory-constrained scales.

Abstract

Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2 speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.

Paper Structure

This paper contains 40 sections, 10 equations, 19 figures, 2 tables, 2 algorithms.

Figures (19)

  • Figure 1: Overview of our proposed cache-aware routing method. (Left) MoE models are hosted in slower Flash storage due to their size, with only a subset of expert weights cached in faster DRAM. Our cache prior method is cache-aware and adjusts expert selection to promote experts already in DRAM, significantly reducing cache misses and improving inference efficiency. (Right) Throughput of our routing method for the 4-bit and 8-bit quantized Qwen1.5-MoE models deployed on two mobile devices with 12GB and 16GB available memory and cache size of 30 and 45 experts, respectively. Our proposed Cache-Aware Routing method significantly enhances the token generation throughput compared to the baseline leveraging a Least Recently Used (LRU) Cache. †https://en.wikipedia.org/wiki/Universal_Flash_Storage
  • Figure 2: Expert sensitivity analysis. We show the effect of dropping or replacing experts as selected by the router. The x-axis represents the expert rank (ordered by their scores) and the y-axis shows the Wikitext validation perplexity (lower values indicate better performance). The dashed lines represent the baseline perplexity of the MoE models.
  • Figure 3: Our proposed Cache-Prior routing method adds a bias to the logits only for in-cache experts $\mathbf{m}_t$, encouraging their selection. The magnitude of the bias is determined by the average logit range, $\Delta_\text{avg}$, and the tradeoff parameter $\lambda$. The updated logits, $\mathbf{z}_t'$, are used only for re-ranking experts, while the expert weights, $\mathbf{w}_t$, remain unchanged.
  • Figure 4: Trade-off curves between Wikitext perplexity and cache miss rate for four MoE models with a cache size set to half the total number of experts.
  • Figure 5: The trade-off between MMLU (5 shots) task accuracy and cache miss rate. For each method, points along the curve form the Pareto front, showcasing the best achievable accuracy for a given cache miss rate.
  • ...and 14 more figures