Mixture of Cache-Conditional Experts for Efficient Mobile Device Inference
Andrii Skliar, Ties van Rozendaal, Romain Lepert, Todor Boinovski, Mart van Baalen, Markus Nagel, Paul Whatmough, Babak Ehteshami Bejnordi
TL;DR
This work tackles the challenge of deploying Mixture of Experts (MoE) models on memory-limited devices by introducing training-free cache-aware routing strategies to improve on-device throughput. It presents three methods—Max Rank, Cumulative Probability Threshold, and Cache-Prior—that bias expert selection toward those cached in DRAM while preserving accuracy. Across WikiText, MMLU, GSM8K, and on-device Android deployments, Cache-Prior delivers Pareto-dominant improvements in cache efficiency and, in many cases, accuracy, achieving up to 2x speedups with minimal performance loss. The results demonstrate robust, hardware-agnostic gains and offer a tunable latency-accuracy trade-off through a single parameter, enabling practical on-device MoE deployment at memory-constrained scales.
Abstract
Mixture of Experts (MoE) LLMs have recently gained attention for their ability to enhance performance by selectively engaging specialized subnetworks or "experts" for each input. However, deploying MoEs on memory-constrained devices remains challenging, particularly when generating tokens sequentially with a batch size of one, as opposed to typical high-throughput settings involving long sequences or large batches. In this work, we optimize MoE on memory-constrained devices where only a subset of expert weights fit in DRAM. We introduce a novel cache-aware routing strategy that leverages expert reuse during token generation to improve cache locality. We evaluate our approach on language modeling, MMLU, and GSM8K benchmarks and present on-device results demonstrating 2$\times$ speedups on mobile devices, offering a flexible, training-free solution to extend MoE's applicability across real-world applications.
