Table of Contents
Fetching ...

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

Kexin Chu, Dawei Xiang, Zixu Shen, Yiwei Yang, Zecheng Liu, Wei Zhang

TL;DR

MoE models suffer from enormous memory footprints because all experts must reside in GPU memory, which hampers deployment on consumer GPUs. The paper introduces DynaExq, a runtime system that treats expert precision as a dynamic resource, combining a hotness-aware precision controller, an asynchronous expert-swapping pipeline, and a fragmentation-free memory pool to operate within per-layer $M_{HBM}$ budgets. It uses EMA-based activation statistics $S_i^{(t)}$ and a threshold $\tau_h$ to promote or demote experts, enabling non-blocking transitions and continuous inference. Across Qwen3-30B and Qwen3-80B on RTX-5090 and A6000, DynaExq achieves up to 4.03 percentage-point accuracy gains over static low-bit baselines while maintaining competitive latency, illustrating that adaptive, workload-aware quantization is effective for memory-constrained MoE serving on single GPUs.

Abstract

Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Dynamic Expert Quantization for Scalable Mixture-of-Experts Inference

TL;DR

MoE models suffer from enormous memory footprints because all experts must reside in GPU memory, which hampers deployment on consumer GPUs. The paper introduces DynaExq, a runtime system that treats expert precision as a dynamic resource, combining a hotness-aware precision controller, an asynchronous expert-swapping pipeline, and a fragmentation-free memory pool to operate within per-layer budgets. It uses EMA-based activation statistics and a threshold to promote or demote experts, enabling non-blocking transitions and continuous inference. Across Qwen3-30B and Qwen3-80B on RTX-5090 and A6000, DynaExq achieves up to 4.03 percentage-point accuracy gains over static low-bit baselines while maintaining competitive latency, illustrating that adaptive, workload-aware quantization is effective for memory-constrained MoE serving on single GPUs.

Abstract

Mixture-of-Experts (MoE) models scale LLM capacity efficiently, but deployment on consumer GPUs is limited by the large memory footprint of inactive experts. Static post-training quantization reduces storage costs but cannot adapt to shifting activation patterns, causing accuracy loss under aggressive compression. So we present DynaExq, a runtime system that treats expert precision as a first-class, dynamically managed resource. DynaExq combines (1) a hotness-aware precision controller that continuously aligns expert bit-widths with long-term activation statistics, (2) a fully asynchronous precision-switching pipeline that overlaps promotion and demotion with MoE computation, and (3) a fragmentation-free memory pooling mechanism that supports hybrid-precision experts with deterministic allocation. Together, these components enable stable, non-blocking precision transitions under strict HBM budgets. Across Qwen3-30B and Qwen3-80B MoE models and six representative benchmarks, DynaExq deploys large LLMs on single RTX 5090 and A6000 GPUs and improves accuracy by up to 4.03 points over static low-precision baselines. The results show that adaptive, workload-aware quantization is an effective strategy for memory-constrained MoE serving.

Paper Structure

This paper contains 21 sections, 3 equations, 5 figures, 1 table, 1 algorithm.

Figures (5)

  • Figure 1: Expert activation distributions across layers and workloads. Each subplot shows routing frequency for 128 experts in layer 1 and layer 47.
  • Figure 2: System architecture of DynaExq.
  • Figure 3: An example of Asynchronous Expert Swapping Pipeline.
  • Figure 4: Perplexity Impact of Varying Low-Precision Expert Ratios Across Layers in DynaExq (Qwen3-30B-A3B: FP16 vs. Int4, Qwen3-80B-A3B: Int4 vs. Int2)
  • Figure 5: The comparison of latency/throughput across model/method.