Table of Contents
Fetching ...

Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning

Zhen Li, Yupeng Su, Runming Yang, Congkai Xie, Zheng Wang, Zhongwei Xie, Ngai Wong, Hongxia Yang

TL;DR

This work tackles the challenge of deploying large language models for mathematical reasoning under practical efficiency constraints by applying low-bit quantization. It systematically quantizes Llama-3 models to 4-bit weights and 16-bit activations (AWQ and GPTQ) and evaluates their reasoning on the MATH benchmark, revealing significant degradation in conceptual and numerical reasoning. To address this, the authors develop a multidimensional error-analysis framework and a lightweight restoration pipeline using LoRA/QLoRA adapters and Direct Preference Optimization, achieving recovery with only 545 targeted examples trained in about 3 minutes on 4 GPUs, and a diagnostic accuracy of $98.9\%$ across $3{,}366$ failure cases. The findings offer actionable guidance for balancing efficiency and reasoning fidelity in quantized LLMs and highlight practical strategies for rapid capability restoration in math-heavy tasks.

Abstract

Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. Our results demonstrate that aggressive quantization methods like AWQ and GPTQ introduce up to 32.39% accuracy degradation (average 11.31%) on Llama-3 models, particularly in numerical computation and reasoning planning. To address this, we introduce a multidimensional evaluation framework combining qualitative capability analysis and quantitative error assessment. We further develop targeted recovery strategies, showing that fine-tuning quantized models on only 545 task-specific examples for 3 minutes on 4 GPUs effectively restores reasoning capabilities to near full-precision levels. Additionally, our error assessment pipeline achieves 98.9% accuracy in diagnosing and localizing errors across 3,366 failure cases, providing actionable insights for mitigating quantization-induced degradation.

Quantization Meets Reasoning: Exploring LLM Low-Bit Quantization Degradation for Mathematical Reasoning

TL;DR

This work tackles the challenge of deploying large language models for mathematical reasoning under practical efficiency constraints by applying low-bit quantization. It systematically quantizes Llama-3 models to 4-bit weights and 16-bit activations (AWQ and GPTQ) and evaluates their reasoning on the MATH benchmark, revealing significant degradation in conceptual and numerical reasoning. To address this, the authors develop a multidimensional error-analysis framework and a lightweight restoration pipeline using LoRA/QLoRA adapters and Direct Preference Optimization, achieving recovery with only 545 targeted examples trained in about 3 minutes on 4 GPUs, and a diagnostic accuracy of across failure cases. The findings offer actionable guidance for balancing efficiency and reasoning fidelity in quantized LLMs and highlight practical strategies for rapid capability restoration in math-heavy tasks.

Abstract

Large language models have achieved significant advancements in complex mathematical reasoning benchmarks, such as MATH. However, their substantial computational requirements present challenges for practical deployment. Model quantization has emerged as an effective strategy to reduce memory usage and computational costs by employing lower precision and bit-width representations. In this study, we systematically evaluate the impact of quantization on mathematical reasoning tasks. Our results demonstrate that aggressive quantization methods like AWQ and GPTQ introduce up to 32.39% accuracy degradation (average 11.31%) on Llama-3 models, particularly in numerical computation and reasoning planning. To address this, we introduce a multidimensional evaluation framework combining qualitative capability analysis and quantitative error assessment. We further develop targeted recovery strategies, showing that fine-tuning quantized models on only 545 task-specific examples for 3 minutes on 4 GPUs effectively restores reasoning capabilities to near full-precision levels. Additionally, our error assessment pipeline achieves 98.9% accuracy in diagnosing and localizing errors across 3,366 failure cases, providing actionable insights for mitigating quantization-induced degradation.
Paper Structure (24 sections, 9 figures, 4 tables, 1 algorithm)

This paper contains 24 sections, 9 figures, 4 tables, 1 algorithm.

Figures (9)

  • Figure 1: Pipeline of our study for evaluating and restoring reasoning capabilities in quantized models. By format alignment training and our error assessment pipeline with expert judge models, we diagnose and analyze ability-level reasoning weaknesses on model's step-by-step solutions. Based on identified error types, we sample a targeted ‘medicine’ dataset to fine-tune the model via Direct Preference Optimization (DPO), aiming to restore performance while preserving efficiency.
  • Figure 2: An example of the data used during training. The output is concatenated using added tokens, and the final answer is filled in boxed{} in the evaluation format.
  • Figure 3: Distribution of error types in quantized models, highlights the dominant error types affecting mathematical reasoning in quantized models.
  • Figure 4: The radar plot of error distributions across different scale Llama models on the MATH benchmark.
  • Figure 5: Comparative analysis of error types for quantized models (AWQ-W4A16 and GPTQ-W4A16) across three model scales.
  • ...and 4 more figures