Table of Contents
Fetching ...

M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type

Weiming Hu, Haoyan Zhang, Cong Guo, Yu Feng, Renyang Guan, Zhendong Hua, Zihan Liu, Yue Guan, Minyi Guo, Jingwen Leng

TL;DR

The paper tackles memory and compute bottlenecks in LLM inference by proposing MANT, a mathematically adaptive numerical type for group-wise quantization. By introducing a flexible mapping with a per-group coefficient $a$, offline weight encoding, and a real-time KV-cache engine, MANT achieves efficient low-bit quantization with decode-compute fusion in a systolic-array accelerator. Key contributions include the Value_grid mapping, variance-driven selection of $a$, offline weight encoding, real-time KV quantization, and a mixed-precision PE design that yields avg. speedups of $2.99\times$ and energy reductions of $2.81\times$ over state-of-the-art accelerators. The work demonstrates that per-group adaptivity and real-time KV handling are essential for scalable, efficient LLM inference at scale.

Abstract

Large language models (LLMs) are one of the most important killer computer applications. The recent algorithmic advancement proposes a fine-grained group-wise quantization for LLMs, which treats a small set (e.g., 64) of values in a tensor as a compression unit. It effectively preserves the model accuracy without retraining, and has become the standard approach to efficiently deploy LLMs. On the other hand, there are works that propose various adaptive data types to better adapt to different distributions and further reduce the required bit length for LLMs. In this work, our detailed analysis unveils a key finding that while different tensors exhibit similar distributions, small groups can have markedly different distributions. As such, the group-level diversity requires a new level of adaptivity for which existing adaptive data types fail to provide. In this paper, we propose MANT, a mathematically adaptive numeric type, featuring a more flexible encoding paradigm with a wider range of data distribution and more efficient decodingcomputation fusion mechanism to address these challenges. Based on MANT, we develop a supporting framework to assign the appropriate data type for each group adaptively. Meanwhile, the dynamically generated Key-Value (KV) caches in LLMs introduce further complexity for real-time quantization. To tackle this, we propose an efficient real-time quantization mechanism. Besides, we implement a specific processing element (PE) to efficiently support MANT and incorporate a real-time quantization unit. By integrating these components into a systolic array, MANT unifies the group-wise weight and KV cache quantization and addresses the associated challenges. Our evaluation shows achieving, on average, 2.99x (up to 4.46x) speedup and 2.81x (up to 4.10x) energy reduction to the state-of-the-art LLM accelerator.

M-ANT: Efficient Low-bit Group Quantization for LLMs via Mathematically Adaptive Numerical Type

TL;DR

The paper tackles memory and compute bottlenecks in LLM inference by proposing MANT, a mathematically adaptive numerical type for group-wise quantization. By introducing a flexible mapping with a per-group coefficient , offline weight encoding, and a real-time KV-cache engine, MANT achieves efficient low-bit quantization with decode-compute fusion in a systolic-array accelerator. Key contributions include the Value_grid mapping, variance-driven selection of , offline weight encoding, real-time KV quantization, and a mixed-precision PE design that yields avg. speedups of and energy reductions of over state-of-the-art accelerators. The work demonstrates that per-group adaptivity and real-time KV handling are essential for scalable, efficient LLM inference at scale.

Abstract

Large language models (LLMs) are one of the most important killer computer applications. The recent algorithmic advancement proposes a fine-grained group-wise quantization for LLMs, which treats a small set (e.g., 64) of values in a tensor as a compression unit. It effectively preserves the model accuracy without retraining, and has become the standard approach to efficiently deploy LLMs. On the other hand, there are works that propose various adaptive data types to better adapt to different distributions and further reduce the required bit length for LLMs. In this work, our detailed analysis unveils a key finding that while different tensors exhibit similar distributions, small groups can have markedly different distributions. As such, the group-level diversity requires a new level of adaptivity for which existing adaptive data types fail to provide. In this paper, we propose MANT, a mathematically adaptive numeric type, featuring a more flexible encoding paradigm with a wider range of data distribution and more efficient decodingcomputation fusion mechanism to address these challenges. Based on MANT, we develop a supporting framework to assign the appropriate data type for each group adaptively. Meanwhile, the dynamically generated Key-Value (KV) caches in LLMs introduce further complexity for real-time quantization. To tackle this, we propose an efficient real-time quantization mechanism. Besides, we implement a specific processing element (PE) to efficiently support MANT and incorporate a real-time quantization unit. By integrating these components into a systolic array, MANT unifies the group-wise weight and KV cache quantization and addresses the associated challenges. Our evaluation shows achieving, on average, 2.99x (up to 4.46x) speedup and 2.81x (up to 4.10x) energy reduction to the state-of-the-art LLM accelerator.

Paper Structure

This paper contains 53 sections, 8 equations, 15 figures, 5 tables.

Figures (15)

  • Figure 1: LLM accuracy with different quantization granularities. We report the perplexity (PPL) metric (lower is better).
  • Figure 2: Accuracy loss for INT, ANT, and Ideal (clustering algorithm K-Means) adaptive methods in group quantization.
  • Figure 3: The cumulative distribution function (CDF) of the tensor, channel, and group, respectively. The tensor data were taken from layers 8 to 23, while the 16 channel and group data were sampled from one tensor with specific strides.
  • Figure 4: Comparison of group-wise K and V cache quantization. They have different inner dimensions due to the transposition of K (key).
  • Figure 5: Using different $a$ in MANT for data type approximation.
  • ...and 10 more figures