Table of Contents
Fetching ...

Mixture of Lookup Experts

Shibo Jie, Yehui Tang, Kai Han, Yitong Li, Duyu Tang, Zhi-Hong Deng, Yunhe Wang

TL;DR

The paper tackles the VRAM and latency bottlenecks of large Mixture-of-Experts models by introducing Mixture of Lookup Experts (MoLE). MoLE trains with embedding-token inputs and all experts active, then re-parameterizes each expert as a computation-free LUT for inference, offloading the LUTs to storage and performing only lookups driven by input IDs. This yields inference speeds similar to dense models while maintaining MoE-level performance and dramatically reducing per-token parameter transfers compared to traditional MoE with expert offloading. The approach achieves substantial efficiency gains, enabling scalable, edge-friendly deployment without sacrificing accuracy, and highlights directions for LUT compression and discrete expert design.

Abstract

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.

Mixture of Lookup Experts

TL;DR

The paper tackles the VRAM and latency bottlenecks of large Mixture-of-Experts models by introducing Mixture of Lookup Experts (MoLE). MoLE trains with embedding-token inputs and all experts active, then re-parameterizes each expert as a computation-free LUT for inference, offloading the LUTs to storage and performing only lookups driven by input IDs. This yields inference speeds similar to dense models while maintaining MoE-level performance and dramatically reducing per-token parameter transfers compared to traditional MoE with expert offloading. The approach achieves substantial efficiency gains, enabling scalable, edge-friendly deployment without sacrificing accuracy, and highlights directions for LUT compression and discrete expert design.

Abstract

Mixture-of-Experts (MoE) activates only a subset of experts during inference, allowing the model to maintain low inference FLOPs and latency even as the parameter count scales up. However, since MoE dynamically selects the experts, all the experts need to be loaded into VRAM. Their large parameter size still limits deployment, and offloading, which load experts into VRAM only when needed, significantly increase inference latency. To address this, we propose Mixture of Lookup Experts (MoLE), a new MoE architecture that is efficient in both communication and VRAM usage. In MoLE, the experts are Feed-Forward Networks (FFNs) during training, taking the output of the embedding layer as input. Before inference, these experts can be re-parameterized as lookup tables (LUTs) that retrieves expert outputs based on input ids, and offloaded to storage devices. Therefore, we do not need to perform expert computations during inference. Instead, we directly retrieve the expert's computation results based on input ids and load them into VRAM, and thus the resulting communication overhead is negligible. Experiments show that, with the same FLOPs and VRAM usage, MoLE achieves inference speeds comparable to dense models and significantly faster than MoE with experts offloading, while maintaining performance on par with MoE.

Paper Structure

This paper contains 20 sections, 14 equations, 3 figures, 8 tables.

Figures (3)

  • Figure 1: With the same 410M activated parameters, MoE outperforms the dense model in terms of performance, but it comes with significant VRAM usage. If experts are offloaded, inference latency will increase. Our MoLE maintains competitive performance without increasing the model's VRAM usage or decoding latency.
  • Figure 2: Illustration of MoLE. During training, MoLE differs from MoE in two key structural aspects: i) The routed experts in MoLE take embedding tokens as input. ii) All experts in MoLE are activated. During inference, the routed experts in MoLE are re-parameterized as zero-computation, offloaded LUTs. For simplicity, normalization layers and residual connections of attention layers are omitted.
  • Figure 3: Decoding latency. We use experts offloading for MoE. The light-colored portion of the bars represents the delay caused by loading.