Table of Contents
Fetching ...

QCQA: Quality and Capacity-aware grouped Query Attention

Vinay Joshi, Prashant Laddha, Shambhavi Sinha, Om Ji Omer, Sreenivas Subramoney

TL;DR

Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function, achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA.

Abstract

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 $7\,$B model, QCQA achieves $\mathbf{20}$\% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides $\mathbf{10.55}\,$\% higher accuracy than GQA. Furthermore, QCQA requires $40\,$\% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference.

QCQA: Quality and Capacity-aware grouped Query Attention

TL;DR

Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function, achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA.

Abstract

Excessive memory requirements of key and value features (KV-cache) present significant challenges in the autoregressive inference of large language models (LLMs), restricting both the speed and length of text generation. Approaches such as Multi-Query Attention (MQA) and Grouped Query Attention (GQA) mitigate these challenges by grouping query heads and consequently reducing the number of corresponding key and value heads. However, MQA and GQA decrease the KV-cache size requirements at the expense of LLM accuracy (quality of text generation). These methods do not ensure an optimal tradeoff between KV-cache size and text generation quality due to the absence of quality-aware grouping of query heads. To address this issue, we propose Quality and Capacity-Aware Grouped Query Attention (QCQA), which identifies optimal query head groupings using an evolutionary algorithm with a computationally efficient and inexpensive fitness function. We demonstrate that QCQA achieves a significantly better tradeoff between KV-cache capacity and LLM accuracy compared to GQA. For the Llama2 B model, QCQA achieves \% higher accuracy than GQA with similar KV-cache size requirements in the absence of fine-tuning. After fine-tuning both QCQA and GQA, for a similar KV-cache size, QCQA provides \% higher accuracy than GQA. Furthermore, QCQA requires \% less KV-cache size than GQA to attain similar accuracy. The proposed quality and capacity-aware grouping of query heads can serve as a new paradigm for KV-cache optimization in autoregressive LLM inference.
Paper Structure (20 sections, 4 equations, 8 figures, 1 table, 2 algorithms)

This paper contains 20 sections, 4 equations, 8 figures, 1 table, 2 algorithms.

Figures (8)

  • Figure 1: The average accuracy of GQA and QCQA at $50\,$% of the original KV-cache size. For QCQA we allow group cardinality to be either equal (QCQA-EC) or arbitrary (QCQA-AC). QCQA-AC outperforms QCQA-EC and GQA, and without fine-tuning QCQA-AC performs the same as that of GQA fine-tuned for 3 epochs.
  • Figure 2: Illustration of grouping approaches employed in grouped-query attention (GQA), multi-query attention (MQA), and QCQA. Solid rectangles indicate heads, outlined rectangles indicate a head that was merged, and solid rectangles with a black outline indicate a mean-pooled head. The arrows indicate the merging of heads into one. MQA is trained from scratch with a single key (and value) head. GQA forms groups of subsequent query heads of an MHA checkpoint and mean-pools corresponding key (and value) heads. Using an MHA checkpoint, QCQA can form groups with unequal cardinality by leveraging an evolutionary algorithm and inexpensive fitness functions and then mean-pools corresponding key (and value) heads.
  • Figure 3: LLM task accuracy at different KV-cache sizes with multiple grouping approaches. Both QCQA-AC and QCQA-EC consistently outperform GQA.
  • Figure 4: Relation between WSE, KV-cache, and LLM evaluation accuracy.
  • Figure 5: Illustration of QCQA grouping approach for enabling arbitrary cardinality. Each candidate $X$ is a vector of $H$ elements and each element indicates a head index. The value at each head index indicates the group to which the head belongs.
  • ...and 3 more figures