Table of Contents
Fetching ...

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

Elias Frantar, Dan Alistarh

TL;DR

The paper tackles the prohibitive memory costs of trillion-parameter Mixture-of-Experts models by introducing QMoE, a sub-1-bit compression framework that co-designs a dictionary-based encoding with GPU-optimized decoding kernels. It demonstrates that a 1.6T SwitchTransformer (c2048) can be compressed to about 160GB (~0.8 bits/parameter) with minimal accuracy loss and can be executed end-to-end on commodity hardware with under 5% overhead. Key innovations include a scalable data-dependent quantization pipeline, activation-offloading strategies, expert-grouped GPTQ, and a bespoke GPU kernel and encoding scheme that enable fast on-the-fly decompression. This work broadens practical deployment and research access to trillion-parameter MoEs, and is released as open-source to facilitate adoption and further study.

Abstract

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.

QMoE: Practical Sub-1-Bit Compression of Trillion-Parameter Models

TL;DR

The paper tackles the prohibitive memory costs of trillion-parameter Mixture-of-Experts models by introducing QMoE, a sub-1-bit compression framework that co-designs a dictionary-based encoding with GPU-optimized decoding kernels. It demonstrates that a 1.6T SwitchTransformer (c2048) can be compressed to about 160GB (~0.8 bits/parameter) with minimal accuracy loss and can be executed end-to-end on commodity hardware with under 5% overhead. Key innovations include a scalable data-dependent quantization pipeline, activation-offloading strategies, expert-grouped GPTQ, and a bespoke GPU kernel and encoding scheme that enable fast on-the-fly decompression. This work broadens practical deployment and research access to trillion-parameter MoEs, and is released as open-source to facilitate adoption and further study.

Abstract

Mixture-of-Experts (MoE) architectures offer a general solution to the high inference costs of large language models (LLMs) via sparse routing, bringing faster and more accurate models, at the cost of massive parameter counts. For example, the SwitchTransformer-c2048 model has 1.6 trillion parameters, requiring 3.2TB of accelerator memory to run efficiently, which makes practical deployment challenging and expensive. In this paper, we present a solution to this memory problem, in form of a new compression and execution framework called QMoE. Specifically, QMoE consists of a scalable algorithm which accurately compresses trillion-parameter MoEs to less than 1 bit per parameter, in a custom format co-designed with bespoke GPU decoding kernels to facilitate efficient end-to-end compressed inference, with minor runtime overheads relative to uncompressed execution. Concretely, QMoE can compress the 1.6 trillion parameter SwitchTransformer-c2048 model to less than 160GB (20x compression, 0.8 bits per parameter) at only minor accuracy loss, in less than a day on a single GPU. This enables, for the first time, the execution of a trillion-parameter model on affordable commodity hardware, like a single server with 4x NVIDIA A6000 or 8x NVIDIA 3090 GPUs, at less than 5% runtime overhead relative to ideal uncompressed inference. The source code and compressed models are available at github.com/IST-DASLab/qmoe.
Paper Structure (56 sections, 2 equations, 5 figures, 8 tables, 1 algorithm)

This paper contains 56 sections, 2 equations, 5 figures, 8 tables, 1 algorithm.

Figures (5)

  • Figure 1: Example of an MoE Transformer block. Each token is routed to a different fully-connected (FC) block.
  • Figure 2: Illustration of the offloading execution for the sparse part of a Transformer block. An expert $E_2$ and its corresponding input tokens $X_E$ are fetched to GPU memory to produce $E'_2$, which together with the corresponding outputs $Y_E$ are written back to CPU again.
  • Figure 3: List buffer example with 3 samples, indicated by hue.
  • Figure 4: Data format of a dictionary entry; here of 24 weights.
  • Figure 5: (Left) Per-layer compressed kernel performance relative to uncompressed execution. (Right) End-to-end runtimes of compressed models and estimates ($^*$, would require 65/130 GPUs) for bloat16 baselines. c2048 is run on 4$\times$A6000 and 8$\times$3090 GPUs, respectively.