Table of Contents
Fetching ...

LittleBit: Ultra Low-Bit Quantization via Latent Factorization

Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained devices by pushing quantization into the extreme sub-1-bit regime. It introduces LittleBit, which combines latent low-rank factorization with binarized factors and a multi-scale compensation mechanism, augmented by Dual-SVID initialization and Residual Compensation. Through extensive experiments across models from 1.3B to 32B parameters, LittleBit achieves unprecedented effective-bit levels (as low as 0.1 BPW) while maintaining competitive perplexities and zero-shot reasoning, outperforming prior sub-1-bit approaches. The approach yields substantial memory reductions (up to ~70x for some scales) and kernel-level speedups (up to ~11.6x), broadening the practical deployment of capable LLMs on edge devices. The work also provides deep analyses of memory, KV cache, and latency, and discusses practical considerations and future hardware-aware optimizations.

Abstract

Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.

LittleBit: Ultra Low-Bit Quantization via Latent Factorization

TL;DR

The paper tackles the challenge of deploying large language models on resource-constrained devices by pushing quantization into the extreme sub-1-bit regime. It introduces LittleBit, which combines latent low-rank factorization with binarized factors and a multi-scale compensation mechanism, augmented by Dual-SVID initialization and Residual Compensation. Through extensive experiments across models from 1.3B to 32B parameters, LittleBit achieves unprecedented effective-bit levels (as low as 0.1 BPW) while maintaining competitive perplexities and zero-shot reasoning, outperforming prior sub-1-bit approaches. The approach yields substantial memory reductions (up to ~70x for some scales) and kernel-level speedups (up to ~11.6x), broadening the practical deployment of capable LLMs on edge devices. The work also provides deep analyses of memory, KV cache, and latency, and discusses practical considerations and future hardware-aware optimizations.

Abstract

Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31 memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6 speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.

Paper Structure

This paper contains 43 sections, 2 theorems, 16 equations, 8 figures, 12 tables.

Key Result

Proposition 1

As stated in sec:little_bit_arch (Proposition 1), for input $\mathbf{X} \in \mathbb{R}^{\mathrm{seq} \times d_{\mathrm{in}}}$, the primary quantized weight matrix is $\widehat{\mathbf{W}}_{\mathrm{pri}} = \mathrm{diag}(\mathbf{h}) \mathbf{U}_{\mathrm{sign}} \mathrm{diag}(\bm{\ell}) \mathbf{V}_{\math Terms are defined in sec:little_bit_arch.

Figures (8)

  • Figure 1: Low-bit quantization perplexity on Llama2-13B (WikiText-2). LittleBit surpasses the state-of-the-art sub-1-bit quantization technique. Below 0.5 BPW, where the leading prior method degrades sharply, ours remains robust down to 0.1 BPW.
  • Figure 2: Comparison of a standard Transformer layer (left) and the LittleBit architecture (right). LittleBit performs linear transformation using parallel Primary and Residual pathways. The Primary path employs binarized factors ($\mathbf{U}_\mathrm{sign}, \mathbf{V}_\mathrm{sign}$) and FP16 scales ($\mathbf{h}, \mathbf{g}, \bm{\ell}$) on input $\mathbf{X}$, initialized from $\mathbf{W}$ via Dual-SVID. Simultaneously, the Residual path computes a correction with its own parameters ($\mathbf{U}_{\mathrm{res, sign}}, \mathbf{V}_{\mathrm{res, sign}}, \mathbf{h}_\mathrm{res}, \mathbf{g}_\mathrm{res}, \bm{\ell}_\mathrm{res}$) from the approximation residual. Their outputs sum to form $\mathbf{Y}$, eliminating storage of the effective weight matrices $\widehat{\mathbf{W}}_\mathrm{pri}$ and $\widehat{\mathbf{W}}_\mathrm{res}$.
  • Figure 3: Visualization of Dual-SVID initialized weight components for a selected layer in Llama2-7B (Query Weight, Layer 0). Columns, from left to right, represent effective bits of 0.1, 0.3, 0.55, 0.7, 0.8, and 1.0 BPW. Rows display the primary approximation ($\widehat{\mathbf{W}}_{\mathrm{pri},0}$), the residual approximation ($\widehat{\mathbf{W}}_{\mathrm{res},0}$), and their sum ($\widehat{\mathbf{W}}_{0} = \widehat{\mathbf{W}}_{\mathrm{pri},0} + \widehat{\mathbf{W}}_{\mathrm{res},0}$). The rightmost image shows the corresponding crop of the original weight matrix ($\mathbf{W}$) for reference.
  • Figure 4: Zero-shot accuracy (%) on 7 common sense reasoning tasks for the Phi-4 14B model. Compares LittleBit-compressed Phi-4 at 0.55, 0.3, and 0.1 BPW against STBLLM.
  • Figure 5: Conceptual view of KV Cache storage: the standard method (left) stores the full hidden dimension ($d_\mathrm{model}$), whereas LittleBit (right) caches a reduced latent dimension ($r$)
  • ...and 3 more figures

Theorems & Definitions (8)

  • Proposition 1: LittleBit Forward Pass Computation (Proof Detail)
  • proof
  • Claim 1: Quantization Error vs. Factor Rank
  • proof : Proof Sketch / Heuristic Argument
  • Claim 2: Quantization Bias vs. SVD Component Structure
  • proof : Proof Sketch
  • Proposition 2: Potential Advantage of Two‑Stage Quantization of SVD Components
  • proof : Proof Outline and Discussion