Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning
Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, Hongxia Yang
TL;DR
This work tackles the challenge of deploying large language models for mathematical reasoning under practical efficiency constraints by applying low-bit quantization. It systematically quantizes Llama-3 models to 4-bit weights and 16-bit activations (AWQ and GPTQ) and evaluates their reasoning on the MATH benchmark, revealing significant degradation in conceptual and numerical reasoning. To address this, the authors develop a multidimensional error-analysis framework and a lightweight restoration pipeline using LoRA/QLoRA adapters and Direct Preference Optimization, achieving recovery with only 545 targeted examples trained in about 3 minutes on 4 GPUs, and a diagnostic accuracy of $98.9\%$ across $3{,}366$ failure cases. The findings offer actionable guidance for balancing efficiency and reasoning fidelity in quantized LLMs and highlight practical strategies for rapid capability restoration in math-heavy tasks.
Abstract
Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. Our results demonstrate that aggressive quantization methods like AWQ and GPTQ introduce up to 32.39% accuracy degradation (average 11.31%) on Llama-3 models, particularly in numerical computation and reasoning planning. To address this, we introduce a multidimensional evaluation framework combining qualitative capability analysis and quantitative error assessment. We further develop targeted recovery strategies, showing that fine-tuning quantized models on only 545 task-specific examples for 3 minutes on 4 GPUs effectively restores reasoning capabilities to near full-precision levels. Additionally, our error assessment pipeline achieves 98.9% accuracy in diagnosing and localizing errors across 3,366 failure cases, providing actionable insights for mitigating quantization-induced degradation.
