Table of Contents
Fetching ...

SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

Runsheng Bai, Bo Liu, Qiang Liu

TL;DR

SKIM introduces an adaptive, any-bit post-training quantization method for large language models by combining channel-wise greedy bit allocation with a trainable scaling vector. It unifies layer-wise and sensitivity-based quantization objectives under a shared framework and employs a mixed-precision strategy that can operate at non-integer bit levels. Empirically, SKIM reduces perplexity significantly at low bit-widths (notably around 3-bit) and improves MMLU performance while lowering memory footprint, outperforming prior PTQ methods like SqueezeLLM and OmniQuant. The approach broadens deployment feasibility for LLMs by enabling flexible memory-budget trade-offs and reducing manual tuning, with efficient training and packing procedures.

Abstract

Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

SKIM: Any-bit Quantization Pushing The Limits of Post-Training Quantization

TL;DR

SKIM introduces an adaptive, any-bit post-training quantization method for large language models by combining channel-wise greedy bit allocation with a trainable scaling vector. It unifies layer-wise and sensitivity-based quantization objectives under a shared framework and employs a mixed-precision strategy that can operate at non-integer bit levels. Empirically, SKIM reduces perplexity significantly at low bit-widths (notably around 3-bit) and improves MMLU performance while lowering memory footprint, outperforming prior PTQ methods like SqueezeLLM and OmniQuant. The approach broadens deployment feasibility for LLMs by enabling flexible memory-budget trade-offs and reducing manual tuning, with efficient training and packing procedures.

Abstract

Large Language Models (LLMs) exhibit impressive performance across various tasks, but deploying them for inference poses challenges. Their high resource demands often necessitate complex, costly multi-GPU pipelines, or the use of smaller, less capable models. While quantization offers a promising solution utilizing lower precision for model storage, existing methods frequently experience significant performance drops at lower precision levels. Additionally, they typically provide only a limited set of solutions at specific bit levels, many of which are extensively manually tuned. To address these challenges, we propose a new method called SKIM: Scaled K-means clustering wIth Mixed precision. Our approach introduces two novel techniques: 1. A greedy algorithm to solve approximately optimal bit allocation across weight channels, and 2. A trainable scaling vector for non-differentiable K-means clustering. These techniques substantially improve performance and can be adapted to any given bit. Notably, in terms of model perplexity, our method narrows the gap between 3-bit quantized LLaMA models and their full precision counterparts by 16.3% on average.

Paper Structure

This paper contains 37 sections, 9 equations, 6 figures, 7 tables, 3 algorithms.

Figures (6)

  • Figure 1: Our SKIM method adaptively quantizes the model to any specified bit and achieves superior performance. The perplexity reported is of LLaMA-7B on the WikiText2 dataset.
  • Figure 2: Overall procedure of our proposed SKIM algorithm. The method consists of three main part: greedy algorithm for bit allocation, weighted K-Means Clustering based on allocation, and the trainable scaling vector. More details are available in Section \ref{['sec:method']}.
  • Figure 3: Histogram of the channel-wise quantization error for the $self\_attn.q\_proj$ in the second layer of Llama-7B. Errors vary significantly and exhibits a long-tail distribution on the larger side.
  • Figure 4: Error variation of the $self\_attn.q\_proj$ in the first layer of Llama-7B. We randomly sampled 10% of the total rows for clearer visualization, with each point representing one. The horizontal axis indicates the quantization error when using 2 bits, while the vertical axis shows the error after increasing the bit level from 2 to 3. It is important to note that after the increase, same previous quantization error does not imply a similar post-increase error, and larger error does not lead to a larger result as well.
  • Figure 5: Perplexity variation after enabling scaling vector. Perplexity consistently decreases when additional optimization on the scaling vector is applied.
  • ...and 1 more figures