Table of Contents
Fetching ...

Interactions Across Blocks in Post-Training Quantization of Large Language Models

Khasmamad Shabanovi, Lukas Wiest, Vladimir Golkov, Daniel Cremers, Thomas Pfeil

TL;DR

The findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.

Abstract

Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.

Interactions Across Blocks in Post-Training Quantization of Large Language Models

TL;DR

The findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.

Abstract

Post-training quantization is widely employed to reduce the computational demands of neural networks. Typically, individual substructures, such as layers or blocks of layers, are quantized with the objective of minimizing quantization errors in their pre-activations by fine-tuning the corresponding weights. Deriving this local objective from the global objective of minimizing task loss involves two key simplifications: assuming substructures are mutually independent and ignoring the knowledge of subsequent substructures as well as the task loss. In this work, we assess the effects of these simplifications on weight-only quantization of large language models. We introduce two multi-block fine-tuning strategies and compare them against the baseline of fine-tuning single transformer blocks. The first captures correlations of weights across blocks by jointly optimizing multiple quantized blocks. The second incorporates knowledge of subsequent blocks by minimizing the error in downstream pre-activations rather than focusing solely on the quantized block. Our findings indicate that the effectiveness of these methods depends on the specific network model, with no impact on some models but demonstrating significant benefits for others.

Paper Structure

This paper contains 14 sections, 9 equations, 3 figures, 6 tables.

Figures (3)

  • Figure 1: Commonly, for SB-PTQ, each block is independently optimized with the loss attached to its output (a). We propose LA-PTQ (b) and MB-PTQ (c), where the reconstruction loss is attached to a subsequent block. For LA-PTQ, still a single block is optimized (red) while all other blocks are not modified (blue). The blocks that contribute to the computation of the gradient are highlighted in green. For MB-PTQ, multiple blocks are jointly optimized. All the previous blocks are already quantized and fine-tuned (indicated with zig-zag lines).
  • Figure 2: Comparison of task accuracy for look-ahead (LA-) and multi-block (MB-) PTQ against single-block (SB-) PTQ, across varying numbers of blocks $n$ and different network models. For all models, we present the average and standard error over 4 trials.
  • Figure 3: The dependence of task accuracy on the size of the calibration dataset (left) and the number of fine-tuning iterations (left) is shown on the example of LlaMa-2-7B. Default values are highlighted in bold.