Table of Contents
Fetching ...

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts

Pingzhi Li, Xiaolong Jin, Zhen Tan, Yu Cheng, Tianlong Chen

TL;DR

<p>MoE models introduce substantial memory overhead due to their expansive parameterization, and existing post-training quantization (PTQ) methods apply uniform bit-widths that fail to capture the sparse, structure-driven activation patterns of MoE. QuantMoE-Bench benchmarks MoE post-quantization across two representative MoE LLMs and six tasks, revealing that attention layers, shared experts, and early MoE blocks require higher precision for preserved performance. The authors propose structure-aware bit allocation and introduce two data-driven techniques—the outlier-aware linear layer scorer and the MoE block importance predictor—achieving state-of-the-art results (e.g., average task performance improvement to $65.35\%$ from GPTQ’s $64.30\%$) and providing practical methods for efficient MoE quantization. Overall, the work offers a valuable benchmark and actionable strategies for memory-efficient deployment of MoE-based LLMs with minimal performance loss.

Abstract

Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently activated shared experts potentially needing higher precision to maintain model quality. In this paper, we study a fine-grained precision setup for MoE quantization. We explore MoE structure-aware quantization heuristics, ranging from coarse (e.g., MoE layers) to fine granularity (e.g., linear layers). Our investigations reveal critical principles, where different MoE structures require varying numbers of bits for effective quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks including commonsense reasoning and natural language understanding. We further show that an MoE quantized in a fined-grained mixed precision achieved state-of-the-art 65.35% performance on average compared to the baseline 64.30% (i.e., GPTQ). Moreover, based on the findings, we introduce novel data-driven techniques for optimizing bit allocation in MoE quantization, including the outlier-aware linear layer scorer and MoE block importance predictor.

QuantMoE-Bench: Examining Post-Training Quantization for Mixture-of-Experts

TL;DR

<p>MoE models introduce substantial memory overhead due to their expansive parameterization, and existing post-training quantization (PTQ) methods apply uniform bit-widths that fail to capture the sparse, structure-driven activation patterns of MoE. QuantMoE-Bench benchmarks MoE post-quantization across two representative MoE LLMs and six tasks, revealing that attention layers, shared experts, and early MoE blocks require higher precision for preserved performance. The authors propose structure-aware bit allocation and introduce two data-driven techniques—the outlier-aware linear layer scorer and the MoE block importance predictor—achieving state-of-the-art results (e.g., average task performance improvement to from GPTQ’s ) and providing practical methods for efficient MoE quantization. Overall, the work offers a valuable benchmark and actionable strategies for memory-efficient deployment of MoE-based LLMs with minimal performance loss.

Abstract

Mixture-of-Experts (MoE) is a promising way to scale up the learning capacity of large language models. It increases the number of parameters while keeping FLOPs nearly constant during inference through sparse activation. Yet, it still suffers from significant memory overheads due to the vast parameter size, necessitating model compression techniques. Post-training quantization offers a powerful approach for model compression. Existing methods adopt a fixed quantization precision for the entire MoE model. This rigid setup can lead to suboptimal performance, without considering the inherent sparse structure. For example, MoE's sparse routing mechanism leads to different activation patterns, where shared experts are accessed by all tokens while token-conditioned experts are selectively activated. This activation disparity suggests different quantization requirements, with consistently activated shared experts potentially needing higher precision to maintain model quality. In this paper, we study a fine-grained precision setup for MoE quantization. We explore MoE structure-aware quantization heuristics, ranging from coarse (e.g., MoE layers) to fine granularity (e.g., linear layers). Our investigations reveal critical principles, where different MoE structures require varying numbers of bits for effective quantization. Conclusions are supported by extensive benchmarking across two representative MoE models and six tasks including commonsense reasoning and natural language understanding. We further show that an MoE quantized in a fined-grained mixed precision achieved state-of-the-art 65.35% performance on average compared to the baseline 64.30% (i.e., GPTQ). Moreover, based on the findings, we introduce novel data-driven techniques for optimizing bit allocation in MoE quantization, including the outlier-aware linear layer scorer and MoE block importance predictor.
Paper Structure (22 sections, 6 equations, 5 figures, 9 tables, 2 algorithms)

This paper contains 22 sections, 6 equations, 5 figures, 9 tables, 2 algorithms.

Figures (5)

  • Figure 1: After our post-training quantization, the pre-trained MoE (a) is quantized into (b) with a mixed-precision based on structures, demonstrated with Mixtral-8x7B.
  • Figure 2: Visualization of expert usage of the two MoE models used in this work. It is profiled on our quantization calibration data, i.e., $512$ random $4096$ token sequences from the WikiText dataset merity2016pointer.
  • Figure 3: Comparison of allocating more bits (i.e.$4$ bits) for attention and frequent experts with uniform-bits quantization (i.e.$3$ and $8$ bits). The Pareto-optimal solution is $3.29$ bits.
  • Figure 4: Comparison of quantizing more bits for attention vs. FFNN and shared experts v.s. others, evaluated on the Mixtral-8x7B model. FFNN and others' results show the mean and standard deviation (error bars) from $3$ independent trials.
  • Figure 5: (a) Visualization of the outlier-aware linear layer scorer metric applied to each FFNN linear weight matrix within the Mixtral-8x7B model. For clearer visualization, we present separate components, including the gate projection (left), up projection (middle), and down projection (right) in FFNN experts. (b) Visualization of the MoE block importance score predictor metric applied on the DeepSeek-MoE-16B-base model.