A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Yuchen Luo; Fangyue Zhu; Ruining Zhou; Mingzhe Huang; Jian Zhu; Fanyu Fan; Wei Shao

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Yuchen Luo, Fangyue Zhu, Ruining Zhou, Mingzhe Huang, Jian Zhu, Fanyu Fan, Wei Shao

TL;DR

Problem: Ascend NPUs present distinct challenges for PTQ than GPUs when deploying reasoning-oriented LLMs. Approach: empirical evaluation of representative PTQ baselines AWQ, GPTQ, SmoothQuant, FlatQuant on DeepSeek-R1-Distill-Qwen and QwQ-32B with both simulated and real INT8 paths. Findings: 4-bit weight-only is viable only for large models; 8-bit PTQ is stable; 4-bit W-A-KV4 is highly platform-sensitive; real INT8 yields latency gains but end-to-end speedups are limited by dynamic quantization overhead. Significance: informs practical PTQ deployment on Ascend NPUs and motivates NPU-aware quantization research.

Abstract

Post-Training Quantization (PTQ) is crucial for efficient model deployment, yet its effectiveness on Ascend NPU remains under-explored compared to GPU architectures. This paper presents a case study of representative PTQ baselines applied to reasoning-oriented models such as DeepSeek-R1-Distill-Qwen series (1.5B/7B/14B) and QwQ-32B. We evaluate four distinct algorithms, including AWQ, GPTQ, SmoothQuant, and FlatQuant, to cover the spectrum from weight-only compression to advanced rotation-based methods. Our empirical results reveal significant platform sensitivity. While 4-bit weight-only quantization proves viable for larger models, aggressive 4-bit weight-activation schemes suffer from layer-wise calibration instability on the NPU, leading to logic collapse in long-context reasoning tasks. Conversely, standard 8-bit quantization remains numerically stable. Furthermore, a real-world INT8 deployment demonstrates that although optimized kernels reduce latency, dynamic quantization overheads currently limit end-to-end acceleration. These findings offer a practical reference for the feasibility and limitations of deploying quantized reasoning models on Ascend NPU.

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

TL;DR

Abstract

Paper Structure (12 sections, 6 equations, 1 figure, 3 tables)

This paper contains 12 sections, 6 equations, 1 figure, 3 tables.

Introduction
Quantization Frameworks and Empirical Paradigms
Evaluation of Quantized Reasoning Models on ATLAS A2
Experimental Setup
Key Observations
Comparison of FlatQuant-W4A4KV4 Between NPU and X2000
Mitigating Calibration Fragility of W4A4KV4 Rotations on Ascend NPUs
Real-World Deployment and Acceleration on Ascend NPU
Discussion
Platform sensitivity of low-bit quantization.
Why does 4-bit W-A-KV quantization fail for long-context reasoning on Ascend?
Conclusion

Figures (1)

Figure 1: Comparison of layer-wise MSE loss for FlatQuant (W4A4KV4). From left to right: Llama3-8B and DeepSeek-R1-Distill-Qwen-7B on X2000, followed by the same models on NPU. The results illustrate a significant increase in quantization error when transitioning from X2000 to NPU hardware.

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

TL;DR

Abstract

A Case Study of Selected PTQ Baselines for Reasoning LLMs on Ascend NPU

Authors

TL;DR

Abstract

Table of Contents

Figures (1)