Atom: Low-bit Quantization for Efficient and Accurate LLM Serving
Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci
TL;DR
Atom tackles the throughput bottleneck in LLM serving by introducing a low-bit weight-activation quantization framework that exploits modern hardware. It combines mixed-precision with channel reordering, fine-grained group quantization, dynamic activation quantization, and KV-cache quantization, all fused into end-to-end serving workflows. Through comprehensive evaluation on Llama models, Atom achieves up to approximately 7.7× end-to-end throughput gains over FP16 and about 2.5× over INT8, while incurring minimal accuracy loss. The work demonstrates practical, hardware-aware quantization design and integration that substantially improves serving efficiency without sacrificing model quality.
Abstract
The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7\times$ compared to the FP16 and by $2.5\times$ compared to INT8 quantization, while maintaining the same latency target.
