Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference
Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang
TL;DR
MoE models suffer from enormous memory footprints because all experts must reside in GPU memory, which hampers deployment on consumer GPUs. The paper introduces DynaExq, a runtime system that treats expert precision as a dynamic resource, combining a hotness-aware precision controller, an asynchronous expert-swapping pipeline, and a fragmentation-free memory pool to operate within per-layer $M_{HBM}$ budgets. It uses EMA-based activation statistics $S_i^{(t)}$ and a threshold $\tau_h$ to promote or demote experts, enabling non-blocking transitions and continuous inference. Across Qwen3-30B and Qwen3-80B on RTX-5090 and A6000, DynaExq achieves up to 4.03 percentage-point accuracy gains over static low-bit baselines while maintaining competitive latency, illustrating that adaptive, workload-aware quantization is effective for memory-constrained MoE serving on single GPUs.
Abstract
Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.
