Table of Contents
Fetching ...

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

Yingsong Luo, Ling Chen

TL;DR

D density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters based on the impact of weights on the model output.

Abstract

Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at https://github.com/LuoYingSong/DAQ.

DAQ: Density-Aware Post-Training Weight-Only Quantization For LLMs

TL;DR

D density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters based on the impact of weights on the model output.

Abstract

Large language models (LLMs) excel in various tasks but face deployment challenges due to hardware constraints. We propose density-aware post-training weight-only quantization (DAQ), which has two stages: 1) density-centric alignment, which identifies the center of high-density weights and centers the dynamic range on this point to align high-density weight regions with floating-point high-precision regions; 2) learnable dynamic range adjustment, which adjusts the dynamic range by optimizing quantization parameters (i.e., scale and zero-point) based on the impact of weights on the model output. Experiments on LLaMA and LLaMA-2 show that DAQ consistently outperforms the best baseline method, reducing perplexity loss by an average of 22.8% on LLaMA and 19.6% on LLaMA-2. Our code is available at https://github.com/LuoYingSong/DAQ.

Paper Structure

This paper contains 24 sections, 14 equations, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Using two dynamic ranges under two weight distributions. The yellow points represent the original weights. Under specific weight distributions, DAQ can expand and shift the dynamic range to align high-density weight regions with FP high-precision regions.
  • Figure 2: The distribution of 15 randomly selected weight groups (group size: 128) from LLaMA-2-7B. LLaMA-2-7B is a state-of-the-art LLM with 7 billion parameters, known for its strong performance across various natural language processing tasks.
  • Figure 3: The effect of adjusting zero-point and scale on the dynamic range. The yellow line represents the dynamic range, while the blue line represents the quantization range.
  • Figure 4: Perplexity under limited calibration datasets.