Table of Contents
Fetching ...

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

Yilong Zhao, Chien-Yu Lin, Kan Zhu, Zihao Ye, Lequn Chen, Size Zheng, Luis Ceze, Arvind Krishnamurthy, Tianqi Chen, Baris Kasikci

TL;DR

Atom tackles the throughput bottleneck in LLM serving by introducing a low-bit weight-activation quantization framework that exploits modern hardware. It combines mixed-precision with channel reordering, fine-grained group quantization, dynamic activation quantization, and KV-cache quantization, all fused into end-to-end serving workflows. Through comprehensive evaluation on Llama models, Atom achieves up to approximately 7.7× end-to-end throughput gains over FP16 and about 2.5× over INT8, while incurring minimal accuracy loss. The work demonstrates practical, hardware-aware quantization design and integration that substantially improves serving efficiency without sacrificing model quality.

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to $7.7\times$ compared to the FP16 and by $2.5\times$ compared to INT8 quantization, while maintaining the same latency target.

Atom: Low-bit Quantization for Efficient and Accurate LLM Serving

TL;DR

Atom tackles the throughput bottleneck in LLM serving by introducing a low-bit weight-activation quantization framework that exploits modern hardware. It combines mixed-precision with channel reordering, fine-grained group quantization, dynamic activation quantization, and KV-cache quantization, all fused into end-to-end serving workflows. Through comprehensive evaluation on Llama models, Atom achieves up to approximately 7.7× end-to-end throughput gains over FP16 and about 2.5× over INT8, while incurring minimal accuracy loss. The work demonstrates practical, hardware-aware quantization design and integration that substantially improves serving efficiency without sacrificing model quality.

Abstract

The growing demand for Large Language Models (LLMs) in applications such as content generation, intelligent chatbots, and sentiment analysis poses considerable challenges for LLM service providers. To efficiently use GPU resources and boost throughput, batching multiple requests has emerged as a popular paradigm; to further speed up batching, LLM quantization techniques reduce memory consumption and increase computing capacity. However, prevalent quantization schemes (e.g., 8-bit weight-activation quantization) cannot fully leverage the capabilities of modern GPUs, such as 4-bit integer operators, resulting in sub-optimal performance. To maximize LLMs' serving throughput, we introduce Atom, a low-bit quantization method that achieves high throughput improvements with negligible accuracy loss. Atom significantly boosts serving throughput by using low-bit operators and considerably reduces memory consumption via low-bit quantization. It attains high accuracy by applying a novel mixed-precision and fine-grained quantization process. We evaluate Atom on 4-bit weight-activation quantization in the serving context. Atom improves end-to-end throughput (token/s) by up to compared to the FP16 and by compared to INT8 quantization, while maintaining the same latency target.
Paper Structure (21 sections, 5 equations, 11 figures, 4 tables)

This paper contains 21 sections, 5 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Overview of Atom's design. For activation matrices, we dynamically reorder the channels to pick out the outliers. Then, we apply low-bit group quantization to the normal values while using high-bit precision for outliers. For weight matrices, the quantization process can be done statically. We perform fused GEMM and fused FlashInfer flashinfer to boost throughput. We also adopt a quantized KV-cache to reduce memory movement.
  • Figure 2: WikiText2 perplexity on Llama models with different 4-bit weight-activation quantization mechanisms. Atom maintains perplexity results close to the FP16 baseline across all model sizes.
  • Figure 3: Runtime breakdown of Llama-7b inference with different batch sizes. The dense layer represents the batched K, Q, V generation, O projection, and MLP. The self-attention layer is implemented by FlashInfer flashinfer integrated with PageAttention vllm. Results indicate that the dense and self-attention layers together account for over $90$% of the execution time, thereby constraining the throughput.
  • Figure 4: A roofline model of different quantization approaches that characterizes operators by their arithmetic intensity, which is defined as $\text{Ops}/\text{Elements}$. At large batch sizes, the dense layer is compute-bound, which has a large arithmetic intensity, whereas self-attention consistently exhibits a lower arithmetic intensity.
  • Figure 5: Sampled value of an activation matrix from Llama-7b. (a) The activation matrix contains outlier channels, which result in large quantization errors. (b) Atom reorders these outlier channels to the end of the matrix and uses higher precision to quantize them while keeping regular memory access.
  • ...and 6 more figures