Table of Contents
Fetching ...

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

Zhongqian Fu, Tianyi Zhao, Ning Ding, Xianzhi Yu, Xiaosong Li, Yehui Tang, Yunhe Wang

TL;DR

EAQuant tackles post-training quantization for Mixture-of-Experts models by addressing activation outliers, routing sensitivity, and calibration sparsity with three aligned strategies: Expert-Aware Smoothing Aggregation, Expert-Aware Routing Consistency Alignment, and Expert-Aware Calibration Data Balance. The approach unifies per-expert activation smoothing, stabilizes routing decisions under low-bit quantization, and ensures balanced calibration coverage for rarely activated experts, all while maintaining computation at inference. Empirical results across three MoE architectures show state-of-the-art robustness under ultra-low-bit settings (e.g., W4A4, W3A4) with substantial gains in reasoning tasks and perplexity alignment to full-precision baselines, and strong resilience at extreme quantization (W3A3, W2A4). The work provides a practical, scalable path for deploying high-capacity MoE models on resource-constrained devices, supported by open-source code.

Abstract

Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bit constraints.Extensive experiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression.Our code is available at https://github.com/darren-fzq1/EAQuant.

EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization

TL;DR

EAQuant tackles post-training quantization for Mixture-of-Experts models by addressing activation outliers, routing sensitivity, and calibration sparsity with three aligned strategies: Expert-Aware Smoothing Aggregation, Expert-Aware Routing Consistency Alignment, and Expert-Aware Calibration Data Balance. The approach unifies per-expert activation smoothing, stabilizes routing decisions under low-bit quantization, and ensures balanced calibration coverage for rarely activated experts, all while maintaining computation at inference. Empirical results across three MoE architectures show state-of-the-art robustness under ultra-low-bit settings (e.g., W4A4, W3A4) with substantial gains in reasoning tasks and perplexity alignment to full-precision baselines, and strong resilience at extreme quantization (W3A3, W2A4). The work provides a practical, scalable path for deploying high-capacity MoE models on resource-constrained devices, supported by open-source code.

Abstract

Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bit constraints.Extensive experiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression.Our code is available at https://github.com/darren-fzq1/EAQuant.

Paper Structure

This paper contains 26 sections, 12 equations, 7 figures, 12 tables.

Figures (7)

  • Figure 1: Activation outliers across channels in different experts and layers of OLMoE model, exhibiting similar distribution patterns to dense models and posing challenges for quantization. Activation visualizations for other models will be provided in the Appendix.
  • Figure 2: Empirical observations motivating expert-aware MoE quantization. More results for other models are provided in the Appendix.
  • Figure 3: The overview of our proposed EAQuant method with three key components. 1) Expert-Aware Smoothing Aggregation. 2) Expert-Aware Routing Consistency Alignment. 3) Expert-Aware Calibration Data Balance.
  • Figure 4: Extended visualization for Figure \ref{['fig_Activation_distribution']}: Activation outliers across channels in different experts and layers of three MoE models, exhibiting similar distribution patterns to dense models and posing challenges for quantization.
  • Figure 5: Extended visualization for Figure \ref{['fig_observation']}(a): channel-wise concentration of dominant activations across all experts in MoE models.
  • ...and 2 more figures