Table of Contents
Fetching ...

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

Hyochan Chong, Dongkyu Kim, Changdong Kim, Minseop Choi

TL;DR

NanoQuant addresses the practical challenge of deploying enormous LLMs by introducing a post-training quantization method that achieves 1-bit and sub-1-bit weight representations through a low-rank binary factorization learned with Hessian-aware LB-ADMM initialization and a block-wise reconstruction pipeline. The approach integrates precise initialization, error mitigation, and local refinement, followed by a global model reconstruction step based on KL alignment, delivering state-of-the-art accuracy for PTQ at extreme compression. Empirically, NanoQuant compresses models such as Llama-2-70B by up to $25.8\times$ and enables running a 70B model on an $8$ GB GPU, with substantial gains in decoding throughput and energy efficiency on consumer hardware, complemented by scalable datacenter performance gains. The work advances the frontier of sub-1-bit PTQ, reduces deployment barriers, and provides custom binary CUDA kernels to accelerate inference, contributing a practical pathway for democratized, memory-efficient LLM deployment.

Abstract

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8$\times$ in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.

NanoQuant: Efficient Sub-1-Bit Quantization of Large Language Models

TL;DR

NanoQuant addresses the practical challenge of deploying enormous LLMs by introducing a post-training quantization method that achieves 1-bit and sub-1-bit weight representations through a low-rank binary factorization learned with Hessian-aware LB-ADMM initialization and a block-wise reconstruction pipeline. The approach integrates precise initialization, error mitigation, and local refinement, followed by a global model reconstruction step based on KL alignment, delivering state-of-the-art accuracy for PTQ at extreme compression. Empirically, NanoQuant compresses models such as Llama-2-70B by up to and enables running a 70B model on an GB GPU, with substantial gains in decoding throughput and energy efficiency on consumer hardware, complemented by scalable datacenter performance gains. The work advances the frontier of sub-1-bit PTQ, reduces deployment barriers, and provides custom binary CUDA kernels to accelerate inference, contributing a practical pathway for democratized, memory-efficient LLM deployment.

Abstract

Weight-only quantization has become a standard approach for efficiently serving large language models (LLMs). However, existing methods fail to efficiently compress models to binary (1-bit) levels, as they either require large amounts of data and compute or incur additional storage. In this work, we propose NanoQuant, the first post-training quantization (PTQ) method to compress LLMs to both binary and sub-1-bit levels. NanoQuant formulates quantization as a low-rank binary factorization problem, and compresses full-precision weights to low-rank binary matrices and scales. Specifically, it utilizes an efficient alternating direction method of multipliers (ADMM) method to precisely initialize latent binary matrices and scales, and then tune the initialized parameters through a block and model reconstruction process. Consequently, NanoQuant establishes a new Pareto frontier in low-memory post-training quantization, achieving state-of-the-art accuracy even at sub-1-bit compression rates. NanoQuant makes large-scale deployment feasible on consumer hardware. For example, it compresses Llama2-70B by 25.8 in just 13 hours on a single H100, enabling a 70B model to operate on a consumer 8 GB GPU.
Paper Structure (73 sections, 5 theorems, 39 equations, 10 figures, 13 tables, 1 algorithm)

This paper contains 73 sections, 5 theorems, 39 equations, 10 figures, 13 tables, 1 algorithm.

Key Result

Proposition 1

Let the target weight matrix $\mathbf{W}$ possess the intrinsic structure defined by Singular Value Decomposition (SVD) as $\mathbf{W} = \mathbf{L}\mathbf{\Sigma}\mathbf{R}^{\top}$. We analyze the optimal energy distribution by introducing a parameter $\gamma \in [0,1]$: The condition that balances the magnitude distribution to avoid numerical extremes is $\|\mathcal{U}\|_F = \|\mathcal{V}\|_F$.

Figures (10)

  • Figure 1: Perplexity comparison on WikiText-2. NanoQuant achieves state-of-the-art results among post-training quantization (PTQ) methods and is the only framework effectively enabling sub-1-bit compression while outperforming existing binary baselines.
  • Figure 2: Illustration of the NanoQuant compression scheme. The process proceeds in three stages: (a) Factorization, where the weight matrix is decomposed into continuous latent factors ($\mathbf{U}_\text{FP}, \mathbf{V}_\text{FP}$) and floating-point scales ($\mathbf{s}_1, \mathbf{s}_2$) which are fine-tuned to minimize reconstruction error; (b) Binarization, where these optimized factors are quantized into binary matrices ($\mathbf{U}_{\pm 1}, \mathbf{V}_{\pm 1}$) containing $\{-1, +1\}$ values; and (c) Packing, where these values are mapped to bits ($-1 \to 0, +1 \to 1$) and efficiently packed into integer formats (e.g., 8-bit blocks) for memory efficiency.
  • Figure 3: The NanoQuant block reconstruction pipeline for compressing linear layers. The process sequentially optimizes each transformer block through three key phases: (1) Error Propagation Mitigation to adjust full-precision weights for accumulated errors; (2) Low-Rank Binary Initialization, which utilizes Latent Binary ADMM (LB-ADMM) to precisely generate latent binary factors and scales; and (3) Factorized Component Refinement., which fine-tunes the continuous latent matrices and scales using Straight-Through Estimators (STE) before final packing.
  • Figure 4: On 1 NVIDIA RTX 3050 (8GB), NanoQuant delivers up to $3.6\times$ higher decoding throughput, $5.4\times$ lower peak memory usage, and $3.9\times$ greater energy efficiency compared to BF16 baselines for Llama-3.2-1B and 3B models.
  • Figure 5: Datacenter inference efficiency on a single NVIDIA H100 (80GB). NanoQuant enables faster decoding throughput while maintaining superior memory and energy efficiency for Llama-2-13B and Qwen-3-32B, compared to the PyTorch BF16 baseline.
  • ...and 5 more figures

Theorems & Definitions (11)

  • Proposition 1: Optimality of Balanced Scales
  • proof
  • Remark 1: Numerical Stability via Normalization
  • proof : Analysis of Conditioning
  • Lemma 1: Uniform Bound Induced by Percentile Clipping
  • proof
  • Corollary 2: Spectral Control of the Preconditioned Target
  • Lemma 2: SPD Structure and Uniqueness
  • proof
  • Theorem 3: Monotonic Descent of Augmented Lagrangian
  • ...and 1 more