Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen; Cheng Zhang; Xitong Gao; Robert D. Mullins; George A. Constantinides; Yiren Zhao

Optimised Grouped-Query Attention Mechanism for Transformers

Yuang Chen, Cheng Zhang, Xitong Gao, Robert D. Mullins, George A. Constantinides, Yiren Zhao

TL;DR

This work proposes AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance and addresses the GQA's trade-off problem between model performance and hardware efficiency.

Abstract

Grouped-query attention (GQA) has been widely adopted in LLMs to mitigate the complexity of multi-head attention (MHA). To transform an MHA to a GQA, neighbour queries in MHA are evenly split into groups where each group shares the value and key layers. In this work, we propose AsymGQA, an activation-informed approach to asymmetrically grouping an MHA to a GQA for better model performance. Our AsymGQA outperforms the GQA within the same model size budget. For example, AsymGQA LLaMA-2-7B has an accuracy increase of 7.5% on MMLU compared to neighbour grouping. Our approach addresses the GQA's trade-off problem between model performance and hardware efficiency.

Optimised Grouped-Query Attention Mechanism for Transformers

TL;DR

Abstract

Paper Structure (21 sections, 2 equations, 5 figures, 4 tables, 2 algorithms)

This paper contains 21 sections, 2 equations, 5 figures, 4 tables, 2 algorithms.

Introduction
Method
Grouping Strategies
Neighbour grouping
Activation-informed symmetric grouping
Activation-informed asymmetric grouping (AsymGQA)
Activation-Informed Head Similarity
Evaluation
Experiment Setup
Models and datasets
Grouping
Fine-tuning
Remarkable Performance Gain of AsymGQA
Ablation Study
SG vs NG
...and 6 more sections

Figures (5)

Figure 1: Comparison of GQA and AsymGQA. AsymGQA leverages activation-induced layer similarity to determine the attention head grouping for better model performance.
Figure 2: Naive neighbour grouping vs AsymGQA.
Figure 3: #Parameters and floating-point operations (FLOPs) vs group size of attention layer. We see a diminishing hardware efficiency return as the group size increases.
Figure 4: Neighbour grouping vs activation-informed symmetric grouping. Activation-induced similarty between key (value) layers improves model performance even without varied group sizes.
Figure 5: Symmetric vs Asymmetric grouping. Asymmetric further improves model performance by allowing varied group sizes.

Optimised Grouped-Query Attention Mechanism for Transformers

TL;DR

Abstract

Optimised Grouped-Query Attention Mechanism for Transformers

Authors

TL;DR

Abstract

Table of Contents

Figures (5)