Table of Contents
Fetching ...

Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

TL;DR

This work proposes AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance and addresses the GQA's trade-off problem between model performance and hardware efficiency.

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

Optimised Grouped-Query Attention Mechanism for Transformers

TL;DR

This work proposes AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance and addresses the GQA's trade-off problem between model performance and hardware efficiency.

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.
Paper Structure (21 sections, 2 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 2 equations, 5 figures, 4 tables, 2 algorithms.

Figures (5)

  • Figure 1: Comparison of GQA and AsymGQA. AsymGQA leverages activation-induced layer similarity to determine the attention head grouping for better model performance.
  • Figure 2: Naive neighbour grouping vs AsymGQA.
  • Figure 3: #Parameters and floating-point operations (FLOPs) vs group size of attention layer. We see a diminishing hardware efficiency return as the group size increases.
  • Figure 4: Neighbour grouping vs activation-informed symmetric grouping. Activation-induced similarty between key (value) layers improves model performance even without varied group sizes.
  • Figure 5: Symmetric vs Asymmetric grouping. Asymmetric further improves model performance by allowing varied group sizes.