Table of Contents
Fetching ...

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

Young Jin Kim, Raffy Fahim, Hany Hassan Awadalla

TL;DR

The paper addresses the memory and latency bottlenecks of large Mixture-of-Experts models by quantizing only the expert weights to ultra-low-bit representations. It demonstrates that MoE expert FFN weights are unusually robust to quantization, allowing 2-bit quantization with QAT and effective 3–4-bit quantization without retraining, while maintaining or improving multilingual MT performance. Empirical results show substantial memory reductions (up to ~79.6%) and ~1.24x GPU speedups, with MoE configurations sometimes surpassing dense baselines in BLEU scores. These findings enable more efficient deployment of MoE models, though fully optimized ultra-low-bit implementations and 2-bit QAT workflows remain as open opportunities.

Abstract

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training in most cases. In particular, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit expert weights can deliver better model performance than the dense model trained on the same dataset. As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. Combined with an optimized GPU runtime implementation, it also achieves 1.24X speed-up on A100 GPUs.

Mixture of Quantized Experts (MoQE): Complementary Effect of Low-bit Quantization and Robustness

TL;DR

The paper addresses the memory and latency bottlenecks of large Mixture-of-Experts models by quantizing only the expert weights to ultra-low-bit representations. It demonstrates that MoE expert FFN weights are unusually robust to quantization, allowing 2-bit quantization with QAT and effective 3–4-bit quantization without retraining, while maintaining or improving multilingual MT performance. Empirical results show substantial memory reductions (up to ~79.6%) and ~1.24x GPU speedups, with MoE configurations sometimes surpassing dense baselines in BLEU scores. These findings enable more efficient deployment of MoE models, though fully optimized ultra-low-bit implementations and 2-bit QAT workflows remain as open opportunities.

Abstract

Large Mixture of Experts (MoE) models could achieve state-of-the-art quality on various language tasks, including machine translation task, thanks to the efficient model scaling capability with expert parallelism. However, it has brought a fundamental issue of larger memory consumption and increased memory bandwidth bottleneck at deployment time. In this paper, we propose Mixture of Quantized Experts (MoQE) which is a simple weight-only quantization method applying ultra low-bit down to 2-bit quantizations only to expert weights for mitigating the increased memory and latency issues of MoE models. We show that low-bit quantization together with the MoE architecture delivers a reliable model performance while reducing the memory size significantly even without any additional training in most cases. In particular, expert layers in MoE models are much more robust to the quantization than conventional feedforward networks (FFN) layers. In our comprehensive analysis, we show that MoE models with 2-bit expert weights can deliver better model performance than the dense model trained on the same dataset. As a result of low-bit quantization, we show the model size can be reduced by 79.6% of the original half precision floating point (fp16) MoE model. Combined with an optimized GPU runtime implementation, it also achieves 1.24X speed-up on A100 GPUs.
Paper Structure (18 sections, 6 figures, 7 tables)

This paper contains 18 sections, 6 figures, 7 tables.

Figures (6)

  • Figure 1: FFN weight distribution across layers. Even number layers $\{0, 2, ...\}$ are expert FFN layers and odd number layers $\{1, 3, ...\}$ are normal dense FFN layers. (a) shows the first linear layer in FFN and (b) shows the second linear layer in FFN.
  • Figure 2: Quantization impact on different MoE model parts (channel-wise linear quantiztation without any additional training).
  • Figure 3: Quantization performance comparison between MoE and dense models. 10 different language pair scores are averaged.
  • Figure 4: Linear quantization vs log-scale with optimal ${\bm{s}}$ quantization.
  • Figure 5: Linear quantization of expert FFNs with channel-wise and matrix-wise scaling factors.
  • ...and 1 more figures