Table of Contents
Fetching ...

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

Ziyi Guan, Hantao Huang, Yupeng Su, Hong Huang, Ngai Wong, Hao Yu

TL;DR

APTQ tackles the challenge of edge deployment for large language models by introducing an attention-aware post-training mixed-precision quantization framework that leverages second-order Hessian information and the nonlinear dynamics of attention. By applying a Levenberg–Marquardt Hessian approximation to the attention block and using Hessian-trace based sensitivity to allocate 2-bit and 4-bit precision, it achieves near full-precision perplexity with an average of 4 bits and state-of-the-art zero-shot accuracy at around 3.8 bits on LLaMA-7B and LLaMA-13B. The method outperforms prior PTQ approaches (e.g., GPTQ, OWQ, PB-LLM) and demonstrates robustness across perplexity benchmarks (C4, WikiText-2) and diverse zero-shot tasks. Overall, APTQ offers a practical, retraining-free pathway to deploy large transformers on resource-constrained devices without substantial accuracy loss.

Abstract

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.

APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models

TL;DR

APTQ tackles the challenge of edge deployment for large language models by introducing an attention-aware post-training mixed-precision quantization framework that leverages second-order Hessian information and the nonlinear dynamics of attention. By applying a Levenberg–Marquardt Hessian approximation to the attention block and using Hessian-trace based sensitivity to allocate 2-bit and 4-bit precision, it achieves near full-precision perplexity with an average of 4 bits and state-of-the-art zero-shot accuracy at around 3.8 bits on LLaMA-7B and LLaMA-13B. The method outperforms prior PTQ approaches (e.g., GPTQ, OWQ, PB-LLM) and demonstrates robustness across perplexity benchmarks (C4, WikiText-2) and diverse zero-shot tasks. Overall, APTQ offers a practical, retraining-free pathway to deploy large transformers on resource-constrained devices without substantial accuracy loss.

Abstract

Large Language Models (LLMs) have greatly advanced the natural language processing paradigm. However, the high computational load and huge model sizes pose a grand challenge for deployment on edge devices. To this end, we propose APTQ (Attention-aware Post-Training Mixed-Precision Quantization) for LLMs, which considers not only the second-order information of each layer's weights, but also, for the first time, the nonlinear effect of attention outputs on the entire model. We leverage the Hessian trace as a sensitivity metric for mixed-precision quantization, ensuring an informed precision reduction that retains model performance. Experiments show APTQ surpasses previous quantization methods, achieving an average of 4 bit width a 5.22 perplexity nearly equivalent to full precision in the C4 dataset. In addition, APTQ attains state-of-the-art zero-shot accuracy of 68.24\% and 70.48\% at an average bitwidth of 3.8 in LLaMa-7B and LLaMa-13B, respectively, demonstrating its effectiveness to produce high-quality quantized LLMs.
Paper Structure (13 sections, 18 equations, 2 figures, 3 tables, 1 algorithm)

This paper contains 13 sections, 18 equations, 2 figures, 3 tables, 1 algorithm.

Figures (2)

  • Figure 1: Overall architecture of APTQ (Attention-aware Post-Training Mixed-Precision Quantization): Unifying comprehensive transformer attention analysis with layer-specific Hessian trace quantization for enhanced model understanding.
  • Figure 2: Comparative perplexity results of LLaMa-7B using APTQ at various 4-bit ratio against others on C4 dataset