Table of Contents
Fetching ...

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

Wenhua Cheng, Weiwei Zhang, Haihao Shen, Yiyang Cai, Xin He, Kaokao Lv, Yi Liu

TL;DR

The paper addresses deployment challenges of large language models by proposing SignRound, a weight-only quantization method that uses Signed Gradient Descent to optimize rounding offsets and weight clipping, blending QAT and PTQ with a lightweight 200-step process. SignRound introduces trainable parameters for rounding and clipping and uses block-wise reconstruction to minimize a Frobenius-norm objective, enabling efficient, low-overhead inference. Across 7B–70B models, it achieves strong results in 2–4 bit quantization and shows near-lossless performance at 4 bits with model-specific hyperparameter tuning, while maintaining generalization to new models. The authors provide public code and demonstrate SignRound's superior speed and accuracy relative to state-of-the-art rounding methods and weight-only quantization baselines.

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution, significantly reducing memory and storage needs without sacrificing too much performance. In this study, we introduce SignRound, a method that leverages signed gradient descent (SignSGD) to optimize rounding values and weight clipping in just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), delivering exceptional results across 2 to 4 bits while minimizing tuning costs and avoiding additional inference overhead. For example, SignRound achieved absolute average accuracy improvements ranging from 6.91% to 33.22% at 2bits, as measured by the average zero-shot accuracy across 11 tasks. It also demonstrates strong generalization in recent models, achieving near-lossless 4-bit quantization in most scenarios. The source code is publicly available at https://github.com/intel/auto-round.

Optimize Weight Rounding via Signed Gradient Descent for the Quantization of LLMs

TL;DR

The paper addresses deployment challenges of large language models by proposing SignRound, a weight-only quantization method that uses Signed Gradient Descent to optimize rounding offsets and weight clipping, blending QAT and PTQ with a lightweight 200-step process. SignRound introduces trainable parameters for rounding and clipping and uses block-wise reconstruction to minimize a Frobenius-norm objective, enabling efficient, low-overhead inference. Across 7B–70B models, it achieves strong results in 2–4 bit quantization and shows near-lossless performance at 4 bits with model-specific hyperparameter tuning, while maintaining generalization to new models. The authors provide public code and demonstrate SignRound's superior speed and accuracy relative to state-of-the-art rounding methods and weight-only quantization baselines.

Abstract

Large Language Models (LLMs) have demonstrated exceptional proficiency in language-related tasks, but their deployment poses significant challenges due to substantial memory and storage requirements. Weight-only quantization has emerged as a promising solution, significantly reducing memory and storage needs without sacrificing too much performance. In this study, we introduce SignRound, a method that leverages signed gradient descent (SignSGD) to optimize rounding values and weight clipping in just 200 steps. SignRound integrates the advantages of Quantization-Aware Training (QAT) and Post-Training Quantization (PTQ), delivering exceptional results across 2 to 4 bits while minimizing tuning costs and avoiding additional inference overhead. For example, SignRound achieved absolute average accuracy improvements ranging from 6.91% to 33.22% at 2bits, as measured by the average zero-shot accuracy across 11 tasks. It also demonstrates strong generalization in recent models, achieving near-lossless 4-bit quantization in most scenarios. The source code is publicly available at https://github.com/intel/auto-round.
Paper Structure (28 sections, 6 equations, 8 figures, 14 tables, 1 algorithm)

This paper contains 28 sections, 6 equations, 8 figures, 14 tables, 1 algorithm.

Figures (8)

  • Figure 1: An illustration of SignRound. Unlike the direct rounding in RTN, SignRound performs signed gradient descent to fine-tune the rounding and weight clipping through block-wise output reconstruction. After lightweight forward and backward steps, $\textbf{W}_{\text{INT4}}$ has been well optimized. Note that Quant and Dequant are two standard operations for quantization and dequantization respectively.
  • Figure : Mistral-7B, alpha values
  • Figure : Mistral-7B, alpha values
  • Figure : Llama-2-7B, alpha values
  • Figure : Mistral-7B, beta values
  • ...and 3 more figures