GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Farnoosh Javadi; Walid Ahmed; Habib Hajimolahoseini; Foozhan Ataiefard; Mohammad Hassanpour; Saina Asani; Austin Wen; Omar Mohamed Awad; Kangling Liu; Yang Liu

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu

TL;DR

GQKVA addresses the burden of slow pre-training and over-parameterization in transformer models by introducing a generalized attention scheme that groups queries and keys/values. By partitioning Q into $g_q$ groups and KV into $g_{kv}$ groups with $h = g_q g_{kv}$, the method unifies MHA, MQA, GQA, MKVA, and GKVA as special cases and enables attention over all combinations without duplicating head outputs. Empirical results on ViT-small show that GKVA variants can achieve higher accuracy with smaller parameter counts, while GQKVA variants can match or exceed MQA performance with reduced size, illustrating a linear trade-off between model size, training speed (TPS), and accuracy. The findings demonstrate that MHA is not always the best choice and that substantial pre-training acceleration and memory savings are achievable without substantial accuracy loss, with potential applicability to larger transformers in future work. Overall, GQKVA offers a practical pathway to configurable pre-training efficiency and model compression across transformer architectures.

Abstract

Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

TL;DR

groups and KV into

groups with

, the method unifies MHA, MQA, GQA, MKVA, and GKVA as special cases and enables attention over all combinations without duplicating head outputs. Empirical results on ViT-small show that GKVA variants can achieve higher accuracy with smaller parameter counts, while GQKVA variants can match or exceed MQA performance with reduced size, illustrating a linear trade-off between model size, training speed (TPS), and accuracy. The findings demonstrate that MHA is not always the best choice and that substantial pre-training acceleration and memory savings are achievable without substantial accuracy loss, with potential applicability to larger transformers in future work. Overall, GQKVA offers a practical pathway to configurable pre-training efficiency and model compression across transformer architectures.

Abstract

Paper Structure (6 sections, 1 equation, 2 figures, 1 table)

This paper contains 6 sections, 1 equation, 2 figures, 1 table.

Introduction
Method
Preliminaries
Proposed Methods
Experiments
Conclusion

Figures (2)

Figure 1: Illustration of various strategies for grouping queries, keys, and values within the attention mechanism, including Vanilla MHA, MQA, GQA, MKVA, GKVA, and GQKVA.
Figure 2: Both figures highlight the presence of faster and lighter attention mechanisms compared to MHA. They also show performance correlates linearly with model size and TPS.

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

TL;DR

Abstract

GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values

Authors

TL;DR

Abstract

Table of Contents

Figures (2)