Table of Contents
Fetching ...

Fast Inference of Mixture-of-Experts Language Models with Offloading

Artyom Eliseev, Denis Mazur

TL;DR

This work tackles the difficulty of running sparse Mixture-of-Experts language models on consumer hardware by developing an MoE-focused offloading strategy combined with mixed-precision quantization. It introduces an LRU-based expert caching and speculative loading to overlap computation with parameter transfer, enabling interactive generation on devices like Colab T4 and mid-range GPUs. Empirical results on Mixtral-8x7B show meaningful reductions in latency and viable token-per-second rates (2–4 tokens/s) across diverse hardware, highlighting practical accessibility for research and development. The approach offers a scalable path to deploy large MoE LLMs outside of high-end infrastructure, with promising directions for further latency reductions through improved speculative predictions.

Abstract

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.

Fast Inference of Mixture-of-Experts Language Models with Offloading

TL;DR

This work tackles the difficulty of running sparse Mixture-of-Experts language models on consumer hardware by developing an MoE-focused offloading strategy combined with mixed-precision quantization. It introduces an LRU-based expert caching and speculative loading to overlap computation with parameter transfer, enabling interactive generation on devices like Colab T4 and mid-range GPUs. Empirical results on Mixtral-8x7B show meaningful reductions in latency and viable token-per-second rates (2–4 tokens/s) across diverse hardware, highlighting practical accessibility for research and development. The approach offers a scalable path to deploy large MoE LLMs outside of high-end infrastructure, with promising directions for further latency reductions through improved speculative predictions.

Abstract

With the widespread adoption of Large Language Models (LLMs), many deep learning practitioners are looking for strategies of running these models more efficiently. One such strategy is to use sparse Mixture-of-Experts (MoE) - a type of model architectures where only a fraction of model layers are active for any given input. This property allows MoE-based language models to generate tokens faster than their dense counterparts, but it also increases model size due to having multiple experts. Unfortunately, this makes state-of-the-art MoE language models difficult to run without high-end GPUs. In this work, we study the problem of running large MoE language models on consumer hardware with limited accelerator memory. We build upon parameter offloading algorithms and propose a novel strategy that accelerates offloading by taking advantage of innate properties of MoE LLMs. Using this strategy, we build can run Mixtral-8x7B with mixed quantization on desktop hardware and free-tier Google Colab instances.
Paper Structure (15 sections, 2 figures, 2 tables)

This paper contains 15 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: An example of expert loading pattern in Mixtral-8x7B-Instruct for select layers. Blue cells indicate that a certain expert was active when encoding a certain token; deeper blue indicates higher gating weight. Small gray squares show which experts are cached with an LRU cache for $k{=}2$.
  • Figure 2: (left) LRU cache hit ratio for different cache size $k$; (right) speculative loading recall when pre-loading a different number of experts. Regular lines represent loading 1 layer ahead; dashed line stands for 2 layers ahead; dotted line is 10 layers ahead.