Table of Contents
Fetching ...

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang

TL;DR

SDQ-LLM tackles the memory and computation barriers of deploying large language models by enabling extremely low-bit quantization with controllable precision and size. It repurposes Sigma-Delta quantization with upsampling to encode high-precision weights into 1-bit or ~1.58-bit representations, replacing multiplications with additions in linear layers. Key innovations include Hadamard-based weight smoothing and MultiOSR, a layer- and linear-wise OSR allocation strategy that ties quantization sensitivity to weight variance. Extensive experiments on OPT and LLaMA families demonstrate superior perplexity and competitive zero-shot performance under aggressive OSR settings, along with faster quantization times, underscoring the method's practicality for memory-constrained deployments. The work also provides a public implementation to facilitate adoption and further research.

Abstract

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

TL;DR

SDQ-LLM tackles the memory and computation barriers of deploying large language models by enabling extremely low-bit quantization with controllable precision and size. It repurposes Sigma-Delta quantization with upsampling to encode high-precision weights into 1-bit or ~1.58-bit representations, replacing multiplications with additions in linear layers. Key innovations include Hadamard-based weight smoothing and MultiOSR, a layer- and linear-wise OSR allocation strategy that ties quantization sensitivity to weight variance. Extensive experiments on OPT and LLaMA families demonstrate superior perplexity and competitive zero-shot performance under aggressive OSR settings, along with faster quantization times, underscoring the method's practicality for memory-constrained deployments. The work also provides a public implementation to facilitate adoption and further research.

Abstract

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

Paper Structure

This paper contains 13 sections, 9 equations, 6 figures, 4 tables, 1 algorithm.

Figures (6)

  • Figure 1: Schematic Diagram of SDQ-LLM Processing Pipeline. The green dashed-line box represents the original Sigma-Delta Quantizer. $Z^{-1}$stands for a delay element and$\frac{1}{(1 - Z^{-1})}$ represents an integrator.
  • Figure 2: Oversample and noise shaping. The left panel presents the spectrograms of the original signal before and after quantization, whereas the right panel presents those of the upsampled signal before and after quantization.
  • Figure 3: The time domain and frequency domain distribution of the opt-1.3b.layer.3.q_proj weight matrix before (blue) and after (orange) being multiplied by the Hadamard matrix.
  • Figure 4: The figure illustrates the MultiOSR allocation strategy: First, the average OSR for each decoder layer is computed based on the overall average OSR and the parameter variance of the respective layer. Then, within each decoder layer, the OSR for the linear layers (q, k, v, o, etc.) is assigned based on the layer’s average OSR, parameter variance, and weight proportion.
  • Figure 5: Flow chart of linear input during the inference process
  • ...and 1 more figures