Table of Contents
Fetching ...

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

Sayed Mohammadreza Tayaranian Hosseini, Amir Ardakani, Warren J. Gross

TL;DR

InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy, is introduced, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy.

Abstract

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to $22\%$ speedup over previous work and up to $88\%$ over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.

InnerQ: Hardware-aware Tuning-free Quantization of KV Cache for Large Language Models

TL;DR

InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy, is introduced, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy.

Abstract

Reducing the hardware footprint of large language models (LLMs) during decoding is critical for efficient long-sequence generation. A key bottleneck is the key-value (KV) cache, whose size scales with sequence length and easily dominates the memory footprint of the model. Previous work proposed quantization methods that are focused on compressing the KV cache while maintaining its information. We introduce InnerQ, a hardware-aware KV-cache quantization scheme that lowers decode latency without sacrificing accuracy. InnerQ applies group-wise quantization while grouping the cache matrices over their inner dimension. Unlike previous work that group over the outer dimension, InnerQ aligns dequantization with the vector-matrix multiplication and enables scale factor reuse across GPU compute units. This reduces memory accesses and accelerates dequantization, yielding up to speedup over previous work and up to over half-precision vector-matrix multiplication. To preserve fidelity under aggressive compression, InnerQ incorporates (i) hybrid quantization, selecting symmetric or asymmetric quantization per group based on local statistics; (ii) high-precision windows for both the most recent tokens and the attention sink tokens to mitigate outlier leakage; and (iii) per-channel normalization of the key cache, computed once during prefill and folded into the query to avoid runtime overhead. Our evaluation experiments on Llama models shows that InnerQ maintains a few-shot GSM8K performance comparable to non-quantized KV caches and surpasses prior KV cache quantization methods.
Paper Structure (23 sections, 14 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 23 sections, 14 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Depiction of the quantization process in symmetric quantization (left) and hybrid quantization (right) modes. $\hat{\mathcal{G}}_\mathrm{hybrid}$ is the output of the quantization mode with the lower error and $\mathcal{M}$ is the quantized matrix from the mode with the lower error. The operations contained by the dashed gray line impose no additional overhead as they are working on the already loaded data in a memory bound operation.
  • Figure 2: Visualization of the vector-vector multiplication between the floating-point vector $Q$ and one row of the quantized matrix $\hat{K}_\mathrm{cache}$ in an illustrative example. Cells with the same color are in the same quantization group and share a scale factor and zero point. A similar visualization is true for the vector $P$ and the quantized matrix $\hat{V}_\mathrm{cache}$.
  • Figure 3: (a) Speedup of vector-matrix multiplication when the matrix is quantized to 2 bits and grouped over the inner dimension versus when the matrix is in half-precision and multiplied using torch.matmul. (b) Speedup of the vector-matrix multiplication with the 2-bit quantized matrix grouped over the inner dimension versus the outer dimension.
  • Figure 4: Latency of the quantization operation when using hybrid quantization versus symmetric quantization.
  • Figure 5: Effect of changing high-precision window length on the evaluation performance of Llama with quantized KV cache on GSM8k dataset.