Table of Contents
Fetching ...

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

Wei Huang, Haotong Qin, Yangdong Liu, Yawei Li, Qinshuo Liu, Xianglong Liu, Luca Benini, Michele Magno, Shiming Zhang, Xiaojuan Qi

TL;DR

The authors address the challenge of aggressive post-training quantization for large language models by proposing SliM-LLM, a salience-driven, group-wise mixed-precision framework. It introduces Salience-Determined Bit Allocation (SBA) to assign per-group bit-widths based on global weight salience and Salience-Weighted Quantizer Calibration (SQC) to preserve locally salient information within groups, enabling structured, hardware-friendly quantization. Integrated with GPTQ and OmniQuant pipelines (yielding SliM-LLM and SliM-LLM+), the method achieves substantial memory reduction and perplexity improvements across LLaMA and OPT families at 2-bit and 3-bit levels, while maintaining competitive on-device inference efficiency. The approach demonstrates strong ablations and deployment feasibility, with open-source code provided for experimentation.

Abstract

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: \textbf{1)} \textit{Salience-Determined Bit Allocation} adaptively assigns bit-widths to groups within each layer based on their salience; and \textbf{2)} \textit{Salience-Weighted Quantizer Calibration} optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48\% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM$^+$, which incorporates gradient-based quantization, further reduces perplexity by 35.1\%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM

SliM-LLM: Salience-Driven Mixed-Precision Quantization for Large Language Models

TL;DR

The authors address the challenge of aggressive post-training quantization for large language models by proposing SliM-LLM, a salience-driven, group-wise mixed-precision framework. It introduces Salience-Determined Bit Allocation (SBA) to assign per-group bit-widths based on global weight salience and Salience-Weighted Quantizer Calibration (SQC) to preserve locally salient information within groups, enabling structured, hardware-friendly quantization. Integrated with GPTQ and OmniQuant pipelines (yielding SliM-LLM and SliM-LLM+), the method achieves substantial memory reduction and perplexity improvements across LLaMA and OPT families at 2-bit and 3-bit levels, while maintaining competitive on-device inference efficiency. The approach demonstrates strong ablations and deployment feasibility, with open-source code provided for experimentation.

Abstract

Post-training quantization (PTQ) is an effective technique for compressing large language models (LLMs). However, while uniform-precision quantization is computationally efficient, it often compromises model performance. To address this, we propose SliM-LLM, a salience-driven mixed-precision quantization framework that allocates bit-widths at the group-wise. Our approach leverages the observation that important weights follow a structured distribution and introduces two key components: \textbf{1)} \textit{Salience-Determined Bit Allocation} adaptively assigns bit-widths to groups within each layer based on their salience; and \textbf{2)} \textit{Salience-Weighted Quantizer Calibration} optimizes quantizer parameters by incorporating element-level salience. With its structured partitioning, SliM-LLM provides a hardware-friendly solution that matches the efficiency of uniform quantization methods while improving accuracy. Experiments show that SliM-LLM achieves superior performance across various LLMs at low bit-widths. For example, a 2-bit quantized LLaMA-7B model reduces memory usage by nearly 6x compared to the floating-point baseline, decreases perplexity by 48\% compared to state-of-the-art gradient-free PTQ methods, and maintains GPU inference speed. Additionally, the extended version, SliM-LLM, which incorporates gradient-based quantization, further reduces perplexity by 35.1\%. Our code is available at https://github.com/Aaronhuang-778/SliM-LLM
Paper Structure (32 sections, 2 theorems, 8 equations, 11 figures, 15 tables)

This paper contains 32 sections, 2 theorems, 8 equations, 11 figures, 15 tables.

Key Result

Theorem 1

Given the input calibration activation $\boldsymbol{x}\in\mathbb{R}^{t\times m}$ with an outlier channel $\boldsymbol{x_{:,p}^*} \gg \boldsymbol{x}_{:,j}, \forall j\in[0,m], j \neq p$ at the position of channel-$p$. The trace elements of $\boldsymbol{H} = \boldsymbol{x}^\top\boldsymbol{x}$ will show

Figures (11)

  • Figure 1: (a) The perplexity ($\downarrow$) of existing low-bit PTQ methods of LLaMA at 2-bit. Solid-line indicates methods with structured quantization group. (b) Compare PTQ methods with gradient quantizer at 3-bit. (c) Features of current low-bit quantization methods. C denotes codebook-based, S is statistic-based, and G represents gradient-based quantizers.
  • Figure 2: Illustration of our proposed SliM-LLM. The Salience-Determined Bit Allocation (SBA) optimizes activation-aware structured precision, optimizing the global information distribution in quantization. Salience-Weighted Quantizer Calibration (SQC) detects discretely distributed salient weights, enhancing the local important information in LLMs.
  • Figure 3: Salience weight distribution in layer-2 and layer-10 of LLaMA-7B.
  • Figure 4: Local salience distribution of the $10^{th}$ MHA output layer.
  • Figure 5: Ablation results on OPT models. Random refers to randomly selecting lower- and higher-bit groups, while head-tail assigns lower-bit precision to the head groups and higher-bit precision to an equal number of tail groups in the original sequence.
  • ...and 6 more figures

Theorems & Definitions (4)

  • Definition 3.1
  • Theorem 1
  • Theorem 1
  • proof