LittleBit: Ultra Low-Bit Quantization via Latent Factorization
Banseok Lee, Dongkyu Kim, Youngcheon You, Youngmin Kim
TL;DR
The paper tackles the challenge of deploying large language models on resource-constrained devices by pushing quantization into the extreme sub-1-bit regime. It introduces LittleBit, which combines latent low-rank factorization with binarized factors and a multi-scale compensation mechanism, augmented by Dual-SVID initialization and Residual Compensation. Through extensive experiments across models from 1.3B to 32B parameters, LittleBit achieves unprecedented effective-bit levels (as low as 0.1 BPW) while maintaining competitive perplexities and zero-shot reasoning, outperforming prior sub-1-bit approaches. The approach yields substantial memory reductions (up to ~70x for some scales) and kernel-level speedups (up to ~11.6x), broadening the practical deployment of capable LLMs on edge devices. The work also provides deep analyses of memory, KV cache, and latency, and discusses practical considerations and future hardware-aware optimizations.
Abstract
Deploying large language models (LLMs) often faces challenges from substantial memory and computational costs. Quantization offers a solution, yet performance degradation in the sub-1-bit regime remains particularly difficult. This paper introduces LittleBit, a novel method for extreme LLM compression. It targets levels like 0.1 bits per weight (BPW), achieving nearly 31$\times$ memory reduction, e.g., Llama2-13B to under 0.9 GB. LittleBit represents weights in a low-rank form using latent matrix factorization, subsequently binarizing these factors. To counteract information loss from this extreme precision, it integrates a multi-scale compensation mechanism. This includes row, column, and an additional latent dimension that learns per-rank importance. Two key contributions enable effective training: Dual Sign-Value-Independent Decomposition (Dual-SVID) for quantization-aware training (QAT) initialization, and integrated Residual Compensation to mitigate errors. Extensive experiments confirm LittleBit's superiority in sub-1-bit quantization: e.g., its 0.1 BPW performance on Llama2-7B surpasses the leading method's 0.7 BPW. LittleBit establishes a new, viable size-performance trade-off--unlocking a potential 11.6$\times$ speedup over FP16 at the kernel level--and makes powerful LLMs practical for resource-constrained environments. Our code can be found at https://github.com/SamsungLabs/LittleBit.
