Table of Contents
Fetching ...

WaterSIC: information-theoretically (near) optimal linear layer quantization

Egor Lifar, Semyon Savkin, Or Ordentlich, Yury Polyanskiy

TL;DR

A novel algorithm, termed''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations.

Abstract

This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.

WaterSIC: information-theoretically (near) optimal linear layer quantization

TL;DR

A novel algorithm, termed''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations.

Abstract

This paper considers the problem of converting a given dense linear layer to low precision. The tradeoff between compressed length and output discrepancy is analyzed information theoretically (IT). It is shown that a popular GPTQ algorithm may have an arbitrarily large gap to the IT limit. To alleviate this problem, a novel algorithm, termed ''WaterSIC'', is proposed and is shown to be within a rate gap of 0.255 bits to the IT limit, uniformly over all possible covariance matrices of input activations. The key innovation of WaterSIC's is to allocate different quantization rates to different columns (in-features) of the weight matrix, mimicking the classical IT solution known as ''waterfilling''. Applying WaterSIC to the Llama and Qwen family of LLMs establishes new state-of-the-art performance for all quantization rates from 1 to 4 bits.
Paper Structure (6 sections, 3 theorems, 21 equations, 3 figures)

This paper contains 6 sections, 3 theorems, 21 equations, 3 figures.

Key Result

Proposition 3.1

Figures (3)

  • Figure 1: Llama-3.2-1B: WaterSIC vs other algorithms. WaterSIC and Huffman-GPTQ use entropy to report rates, others use log-cardinality.
  • Figure 2: Qwen3-8B: WaterSIC vs other algorithms. WaterSIC, Huffman-GPTQ and Huffman-RTN use entropy to report rates, others use log-cardinality.
  • Figure 3: WikiText-2 bits-per-byte (BPB) vs. compressed model size (GiB) for WaterSIC across multiple base models. Dashed lines connect the same model compressed with various bit rates.

Theorems & Definitions (3)

  • Proposition 3.1
  • Lemma 3.2
  • Theorem 3.3