SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Junhao Xia; Ming Zhao; Limin Xiao; Xiujun Zhang

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

Junhao Xia, Ming Zhao, Limin Xiao, Xiujun Zhang

TL;DR

SDQ-LLM tackles the memory and computation barriers of deploying large language models by enabling extremely low-bit quantization with controllable precision and size. It repurposes Sigma-Delta quantization with upsampling to encode high-precision weights into 1-bit or ~1.58-bit representations, replacing multiplications with additions in linear layers. Key innovations include Hadamard-based weight smoothing and MultiOSR, a layer- and linear-wise OSR allocation strategy that ties quantization sensitivity to weight variance. Extensive experiments on OPT and LLaMA families demonstrate superior perplexity and competitive zero-shot performance under aggressive OSR settings, along with faster quantization times, underscoring the method's practicality for memory-constrained deployments. The work also provides a public implementation to facilitate adoption and further research.

Abstract

Large language models (LLMs) face significant computational and memory challenges, making extremely low-bit quantization crucial for their efficient deployment. In this work, we introduce SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size, a novel framework that enables extremely low-bit quantization of LLMs while preserving their linguistic reasoning capabilities. A distinctive feature of SDQ-LLM is the continuous adjustability of the Over-Sampling Ratio (OSR), enabling dynamic adaptation to memory or VRAM constraints by selecting fractional OSR (e.g. 2.5 times) for an optimal trade-off between model size and accuracy. SDQ-LLM uses upsampling combined with Sigma-Delta Quantizer to binarize or ternarize LLMs weights, encoding high-precision parameters into 1-bit or 1.58-bit representations, replacing the multiplication operations within linear layers with addition. This approach significantly enhances inference efficiency under extremely low-bit quantization. To further reduce the loss of quantization precision, we incorporate Hadamard-based weight smoothing prior to quantization, improving the stability and robustness of the weight representations. Furthermore, to fully leverage the continuity of the OSR and reduce precision loss, recognizing the correlation between quantization sensitivity and weight variance, we propose a fine-grained, layer- and linear-wise OSR allocation strategy, MultiOSR. This strategy distributes OSR both across layers and within each layer, based on weight variance and parameter scale. Finally, extensive experiments on OPT and LLaMA model families demonstrate that SDQ-LLM achieves a more efficient and high-precision performance even under highly aggressive low-OSR settings. Our code is available at https://github.com/Dreamlittlecat/LLM-Quant-Factory.

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

TL;DR

Abstract

SDQ-LLM: Sigma-Delta Quantization for 1-bit LLMs of any size

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)