Table of Contents
Fetching ...

KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

Zukang Xu, Zhixiong Zhao, Xing Hu, Zhixuan Chen, Dawei Yang

TL;DR

The paper tackles the challenge of deploying MoE large language models under tight resource constraints by targeting ultra-low-bit quantization. It introduces KBVQ-MoE, a framework that integrates input-driven redundancy elimination (IDRE) with bias-corrected output stabilization (BCOS) to enable effective vector quantization of MoE weights. Across multiple MoE LLMs, including Qwen and Mixtral, the method achieves near FP16 accuracy at 2–3 bits and delivers substantial memory and speed benefits, validating its practical utility for edge and resource-constrained deployments. The work demonstrates that leveraging MoE structure and input statistics can substantially improve compression performance without altering the underlying MoE topology.

Abstract

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs. KBVQ-MoE integrates two techniques: (1) input-driven redundancy elimination, where a Karhunen-Loeve Transform (KLT) guided singular value decomposition (SVD) extracts dominant weight components and shares them across experts; and (2) bias-corrected output stabilization, where vector quantization is applied only to expert-specific (non-redundant) representations and the quantized outputs are corrected via channel-wise affine compensation. Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.

KBVQ-MoE: KLT-guided SVD with Bias-Corrected Vector Quantization for MoE Large Language Models

TL;DR

The paper tackles the challenge of deploying MoE large language models under tight resource constraints by targeting ultra-low-bit quantization. It introduces KBVQ-MoE, a framework that integrates input-driven redundancy elimination (IDRE) with bias-corrected output stabilization (BCOS) to enable effective vector quantization of MoE weights. Across multiple MoE LLMs, including Qwen and Mixtral, the method achieves near FP16 accuracy at 2–3 bits and delivers substantial memory and speed benefits, validating its practical utility for edge and resource-constrained deployments. The work demonstrates that leveraging MoE structure and input statistics can substantially improve compression performance without altering the underlying MoE topology.

Abstract

Mixture of Experts (MoE) models have achieved great success by significantly improving performance while maintaining computational efficiency through sparse expert activation. However, their enormous parameter sizes and memory demands pose major challenges for deployment in resource-constrained environments. Vector Quantization (VQ) offers a promising approach for ultra-low-bit compression in Large Language Models (LLMs) by leveraging a codebook, where weight vectors are mapped to the most similar discrete codewords. Yet, directly applying VQ to MoEs often leads to substantial performance degradation due to two critical obstacles: (1) redundant representations among experts cause VQ to repeatedly quantize similar representations for each expert, resulting in inefficient use of limited codebook capacity; and (2) cumulative output bias is amplified by expert aggregation in MoE layers, leading to distributional shifts in the quantized outputs. To address these issues, we propose KBVQ-MoE, a novel VQ framework to enhance extremely low-bit quantization for MoE-based LLMs. KBVQ-MoE integrates two techniques: (1) input-driven redundancy elimination, where a Karhunen-Loeve Transform (KLT) guided singular value decomposition (SVD) extracts dominant weight components and shares them across experts; and (2) bias-corrected output stabilization, where vector quantization is applied only to expert-specific (non-redundant) representations and the quantized outputs are corrected via channel-wise affine compensation. Experiments on various MoE LLMs demonstrate that KBVQ-MoE preserves accuracy substantially better than existing quantization methods. For example, 3-bit quantization of Qwen1.5-MoE-A2.7B achieves an average accuracy of 67.99, nearly identical to the FP16 baseline of 68.07, underscoring KBVQ-MoE's potential for efficient deployment on edge devices and other resource-constrained platforms.
Paper Structure (44 sections, 23 equations, 5 figures, 17 tables, 1 algorithm)

This paper contains 44 sections, 23 equations, 5 figures, 17 tables, 1 algorithm.

Figures (5)

  • Figure 1: Average accuracy across multiple MoE architectures, showing that KBVQ-MoE achieves superior performance under 2-bit quantization.
  • Figure 2: Similarity of expert outputs before and after redundancy elimination by KBVQ-MoE.
  • Figure 3: Distributional Shifts in Qwen3-30B-A3B Layer 20 Outputs: (Top) Per-channel Mean Comparisons (FP, Direct VQ, KBVQ-MoE); (Bottom) Per-channel Variance Comparisons (FP, Direct VQ, KBVQ-MoE).
  • Figure 4: The comparison of structural changes in the MoE (Mixture-of-Experts) structure before and after IDRE.
  • Figure 5: The low-rank characteristics after expert fusion activation of the first block, (a) Qwen1.5-MoE-A2.7B, (b) Qwen3-30B-A3B