A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU
Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao
TL;DR
Problem: Ascend NPUs present distinct challenges for PTQ than GPUs when deploying reasoning-oriented LLMs. Approach: empirical evaluation of representative PTQ baselines AWQ, GPTQ, SmoothQuant, FlatQuant on DeepSeek-R1-Distill-Qwen and QwQ-32B with both simulated and real INT8 paths. Findings: 4-bit weight-only is viable only for large models; 8-bit PTQ is stable; 4-bit W-A-KV4 is highly platform-sensitive; real INT8 yields latency gains but end-to-end speedups are limited by dynamic quantization overhead. Significance: informs practical PTQ deployment on Ascend NPUs and motivates NPU-aware quantization research.
Abstract
Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.
