Table of Contents
Fetching ...

Mixture of Lookup Key-Value Experts

Zongcheng Wang

TL;DR

The paper addresses the challenge of running large-scale, sparse mixture-of-experts models on end-user devices under strict memory constraints. It introduces MoLKV, a context-aware extension of MoLE where each expert is a key–value pair and a cached KV subset from the current sequence enables interactive, context-dependent routing and gating. Empirical results show MoLKV improves validation loss compared to MoLE, while maintaining efficient batch inference by keeping most KV operations in RAM and offloading fewer parameters to storage. This work advances on-device inference by combining low-activation routing with context-aware expert outputs, potentially enabling larger, more capable models on resource-constrained hardware.

Abstract

Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.

Mixture of Lookup Key-Value Experts

TL;DR

The paper addresses the challenge of running large-scale, sparse mixture-of-experts models on end-user devices under strict memory constraints. It introduces MoLKV, a context-aware extension of MoLE where each expert is a key–value pair and a cached KV subset from the current sequence enables interactive, context-dependent routing and gating. Empirical results show MoLKV improves validation loss compared to MoLE, while maintaining efficient batch inference by keeping most KV operations in RAM and offloading fewer parameters to storage. This work advances on-device inference by combining low-activation routing with context-aware expert outputs, potentially enabling larger, more capable models on resource-constrained hardware.

Abstract

Recent research has developed several LLM architectures suitable for inference on end-user devices, such as the Mixture of Lookup Experts (MoLE)~\parencite{jie_mixture_2025}. A key feature of MoLE is that each token id is associated with a dedicated group of experts. For a given input, only the experts corresponding to the input token id will be activated. Since the communication overhead of loading this small number of activated experts into RAM during inference is negligible, expert parameters can be offloaded to storage, making MoLE suitable for resource-constrained devices. However, MoLE's context-independent expert selection mechanism, based solely on input ids, may limit model performance. To address this, we propose the \textbf{M}ixture \textbf{o}f \textbf{L}ookup \textbf{K}ey-\textbf{V}alue Experts (\textbf{MoLKV}) model. In MoLKV, each expert is structured as a key-value pair. For a given input, the input-derived query interacts with the cached key-value experts from the current sequence, generating a context-aware expert output. This context-aware mechanism alleviates the limitation of MoLE, and experimental results demonstrate that MoLKV achieves significantly lower validation loss in small-scale evaluations.

Paper Structure

This paper contains 10 sections, 8 equations, 3 tables.