Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers
Marko Karbevski, Antonij Mijoski
TL;DR
This work investigates whether the Query, Key, Value (QKV) weight triplet in decoder-only transformers is essential. By leveraging the Reparametrization Lemma, the authors show that the Query projection can be absorbed into a basis change, allowing $W_Q$ to be set to the identity under specific architectural conditions, and extend these ideas to weight-shared blocks and skip connections. They provide rigorous theorems for exact elimination in simplified settings and validate the theory empirically by pretraining GPT-style models from scratch with $W_Q=I_d$, achieving comparable validation loss while reducing attention parameters by 25% per layer (8.3% of transformer block parameters), after careful hyperparameter tuning. The results suggest that transformers may be overparameterized and that similar architectural simplifications could extend to broader settings, motivating study at larger scales and with more diverse architectures. Collectively, the paper contributes theoretical foundations, pragmatic experiments, and a pathway toward more parameter-efficient decoder-only transformer designs.
Abstract
The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.
