GQKVA: Efficient Pre-training of Transformers by Grouping Queries, Keys, and Values
Farnoosh Javadi, Walid Ahmed, Habib Hajimolahoseini, Foozhan Ataiefard, Mohammad Hassanpour, Saina Asani, Austin Wen, Omar Mohamed Awad, Kangling Liu, Yang Liu
TL;DR
GQKVA addresses the burden of slow pre-training and over-parameterization in transformer models by introducing a generalized attention scheme that groups queries and keys/values. By partitioning Q into $g_q$ groups and KV into $g_{kv}$ groups with $h = g_q g_{kv}$, the method unifies MHA, MQA, GQA, MKVA, and GKVA as special cases and enables attention over all combinations without duplicating head outputs. Empirical results on ViT-small show that GKVA variants can achieve higher accuracy with smaller parameter counts, while GQKVA variants can match or exceed MQA performance with reduced size, illustrating a linear trade-off between model size, training speed (TPS), and accuracy. The findings demonstrate that MHA is not always the best choice and that substantial pre-training acceleration and memory savings are achievable without substantial accuracy loss, with potential applicability to larger transformers in future work. Overall, GQKVA offers a practical pathway to configurable pre-training efficiency and model compression across transformer architectures.
Abstract
Massive transformer-based models face several challenges, including slow and computationally intensive pre-training and over-parametrization. This paper addresses these challenges by proposing a versatile method called GQKVA, which generalizes query, key, and value grouping techniques. GQKVA is designed to speed up transformer pre-training while reducing the model size. Our experiments with various GQKVA variants highlight a clear trade-off between performance and model size, allowing for customized choices based on resource and time limitations. Our findings also indicate that the conventional multi-head attention approach is not always the best choice, as there are lighter and faster alternatives available. We tested our method on ViT, which achieved an approximate 0.3% increase in accuracy while reducing the model size by about 4% in the task of image classification. Additionally, our most aggressive model reduction experiment resulted in a reduction of approximately 15% in model size, with only around a 1% drop in accuracy.
