EAQuant: Enhancing Post-Training Quantization for MoE Models via Expert-Aware Optimization
Zhongqian Fu, Tianyi Zhao, Ning Ding, Xianzhi Yu, Xiaosong Li, Yehui Tang, Yunhe Wang
TL;DR
EAQuant tackles post-training quantization for Mixture-of-Experts models by addressing activation outliers, routing sensitivity, and calibration sparsity with three aligned strategies: Expert-Aware Smoothing Aggregation, Expert-Aware Routing Consistency Alignment, and Expert-Aware Calibration Data Balance. The approach unifies per-expert activation smoothing, stabilizes routing decisions under low-bit quantization, and ensures balanced calibration coverage for rarely activated experts, all while maintaining computation at inference. Empirical results across three MoE architectures show state-of-the-art robustness under ultra-low-bit settings (e.g., W4A4, W3A4) with substantial gains in reasoning tasks and perplexity alignment to full-precision baselines, and strong resilience at extreme quantization (W3A3, W2A4). The work provides a practical, scalable path for deploying high-capacity MoE models on resource-constrained devices, supported by open-source code.
Abstract
Mixture-of-Experts (MoE) models enable scalable computation and performance in large-scale deep learning but face quantization challenges due to sparse expert activation and dynamic routing. Existing post-training quantization (PTQ) methods fail to address activation outliers, routing instability, and sparse expert calibration, leading to significant performance degradation. To address this, we propose EAQuant, a PTQ framework tailored for MoE architectures. Our method introduces three expert-aware innovations: (1) smoothing aggregation to suppress activation outliers, (2) routing consistency alignment to preserve expert selection post-quantization, and (3) calibration data balance to optimize sparsely activated experts. These strategies collectively enable robust, high-precision quantization of MoE models under ultra-low-bit constraints.Extensive experiments across several extreme quantization settings (e.g., W4A4/W3A4/W3A3/W2A4) demonstrate that EAQuant significantly outperforms existing methods, achieving average accuracy improvements of 1.15 - 13.81% across three diverse MoE architectures, with particularly pronounced gains in reasoning tasks and robust performance retention under aggressive quantization. By integrating these innovations, EAQuant establishes a new state-of-the-art for high-precision, efficient MoE model compression.Our code is available at https://github.com/darren-fzq1/EAQuant.
