Table of Contents
Fetching ...

Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models

Wanlong Liu, Yichen Xiao, Dingyi Zeng, Hongyang Zhao, Wenyu Chen, Malu Zhang

TL;DR

This work tackles the challenge of deploying LLMs under resource constraints by improving post-training quantization at ultra-low bit-widths. It introduces MG-PTQ, a graph neural PTQ framework that uses a GNN to capture dependencies among weight columns via a Cholesky-based second-order Hessian, enabling adaptive bit-width allocation under a controllable average bit-width. The method combines a graph perceptual module, a bit-width allocator, and blockwise quantization, with training guided by a quantization-error objective and an average-bit-width constraint using an approximate gradient. Experiments on WikiText2 and C4 demonstrate that MG-PTQ outperforms GPTQ at 2 bits and offers robust efficiency, setting new benchmarks for low-bit LLM quantization.

Abstract

Post-Training Quantization (PTQ) is pivotal for deploying large language models (LLMs) within resource-limited settings by significantly reducing resource demands. However, existing PTQ strategies underperform at low bit levels < 3 bits due to the significant difference between the quantized and original weights. To enhance the quantization performance at low bit widths, we introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights and adaptively assign quantization bit-widths. Through the information propagation of the GNN module, our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance and optimized allocation of quantization strategies. Extensive experiments on the WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms previous state-of-the-art PTQ method GPTQ, setting new benchmarks for quantization performance under low-bit conditions.

Mixed-Precision Graph Neural Quantization for Low Bit Large Language Models

TL;DR

This work tackles the challenge of deploying LLMs under resource constraints by improving post-training quantization at ultra-low bit-widths. It introduces MG-PTQ, a graph neural PTQ framework that uses a GNN to capture dependencies among weight columns via a Cholesky-based second-order Hessian, enabling adaptive bit-width allocation under a controllable average bit-width. The method combines a graph perceptual module, a bit-width allocator, and blockwise quantization, with training guided by a quantization-error objective and an average-bit-width constraint using an approximate gradient. Experiments on WikiText2 and C4 demonstrate that MG-PTQ outperforms GPTQ at 2 bits and offers robust efficiency, setting new benchmarks for low-bit LLM quantization.

Abstract

Post-Training Quantization (PTQ) is pivotal for deploying large language models (LLMs) within resource-limited settings by significantly reducing resource demands. However, existing PTQ strategies underperform at low bit levels < 3 bits due to the significant difference between the quantized and original weights. To enhance the quantization performance at low bit widths, we introduce a Mixed-precision Graph Neural PTQ (MG-PTQ) approach, employing a graph neural network (GNN) module to capture dependencies among weights and adaptively assign quantization bit-widths. Through the information propagation of the GNN module, our method more effectively captures dependencies among target weights, leading to a more accurate assessment of weight importance and optimized allocation of quantization strategies. Extensive experiments on the WikiText2 and C4 datasets demonstrate that our MG-PTQ method outperforms previous state-of-the-art PTQ method GPTQ, setting new benchmarks for quantization performance under low-bit conditions.

Paper Structure

This paper contains 23 sections, 7 equations, 2 figures, 1 table, 1 algorithm.

Figures (2)

  • Figure 1: The overall architecture of MG-PTQ model.
  • Figure 2: Further experimental analysis. Sub-figure (a) presents the ablation study of LLaMA-7b model on C4 dataset, across different quantization bit Depths. And Sub-figure (b) shows the efficiency analysis, where quantization time of LLaMA-7b model is tested across different quantization strategies.