Table of Contents
Fetching ...

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

Marko Karbevski, Antonij Mijoski

TL;DR

This work investigates whether the Query, Key, Value (QKV) weight triplet in decoder-only transformers is essential. By leveraging the Reparametrization Lemma, the authors show that the Query projection can be absorbed into a basis change, allowing $W_Q$ to be set to the identity under specific architectural conditions, and extend these ideas to weight-shared blocks and skip connections. They provide rigorous theorems for exact elimination in simplified settings and validate the theory empirically by pretraining GPT-style models from scratch with $W_Q=I_d$, achieving comparable validation loss while reducing attention parameters by 25% per layer (8.3% of transformer block parameters), after careful hyperparameter tuning. The results suggest that transformers may be overparameterized and that similar architectural simplifications could extend to broader settings, motivating study at larger scales and with more diverse architectures. Collectively, the paper contributes theoretical foundations, pragmatic experiments, and a pathway toward more parameter-efficient decoder-only transformer designs.

Abstract

The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.

Key and Value Weights Are Probably All You Need: On the Necessity of the Query, Key, Value weight Triplet in Decoder-Only Transformers

TL;DR

This work investigates whether the Query, Key, Value (QKV) weight triplet in decoder-only transformers is essential. By leveraging the Reparametrization Lemma, the authors show that the Query projection can be absorbed into a basis change, allowing to be set to the identity under specific architectural conditions, and extend these ideas to weight-shared blocks and skip connections. They provide rigorous theorems for exact elimination in simplified settings and validate the theory empirically by pretraining GPT-style models from scratch with , achieving comparable validation loss while reducing attention parameters by 25% per layer (8.3% of transformer block parameters), after careful hyperparameter tuning. The results suggest that transformers may be overparameterized and that similar architectural simplifications could extend to broader settings, motivating study at larger scales and with more diverse architectures. Collectively, the paper contributes theoretical foundations, pragmatic experiments, and a pathway toward more parameter-efficient decoder-only transformer designs.

Abstract

The Query, Key, Value weight triplet is a building block of current attention mechanisms in state-of-the-art LLMs. We theoretically investigate whether this triplet can be reduced, proving under simplifying assumptions that the Query weights are redundant, thereby reducing the number of non-embedding/lm-head parameters by over 8%. We validate the theory on full-complexity GPT-3 small architectures (with layer normalization, skip connections, and weight decay) trained from scratch, demonstrating that the reduced model achieves comparable validation loss to standard baselines. These findings motivate the investigation of the Query weight redundancy at scale.

Paper Structure

This paper contains 39 sections, 10 theorems, 69 equations, 4 figures, 2 tables.

Key Result

Lemma 3.1

Let $n, d \in \mathbb{N}$ and let $\Omega$ be any set. Consider a function that depends on its first argument only through $XW_Q$, $XW_K$, $XW_V$. That is, suppose there exists a function $g: \text{Mat}(n, d)^3 \to \Omega$ such that Then for any invertible $W_Q \in GL(d)$ and any $W_K, W_V \in \text{Mat}(d,d)$, there exist matrices $\Theta, \widetilde{W}_K, \widetilde{W}_V \in \text{Mat}(d,d)$ s

Figures (4)

  • Figure 1: Per-sample relative L2 error distributions for approximating basis-transformed skip-connected MLPs. The trained GELU MLP (blue) achieves 4--5% relative error across dimensions $h \in \{256, 512, 768\}$, substantially outperforming the optimal linear baseline (orange, 9--10%).
  • Figure 2: Mean per-sample cosine similarity between predicted and target outputs. The trained MLP achieves $0.999$ cosine similarity ($\approx 2.5^\circ$ angular error) while the linear baseline achieves $\approx 0.995$ ($\approx 5.7^\circ$ angular error), demonstrating near-perfect directional alignment.
  • Figure 3: Training and validation loss for tied Embedding/LMHead weights configuration. The reduced model (No $W_Q$, red) closely tracks the standard baseline (blue) throughout training, achieving comparable final performance with fewer parameters.
  • Figure 4: Training and validation loss for untied weights configuration. Both models converge smoothly, with the reduced variant (No $W_Q$, blue) achieving slightly better final validation loss than the standard model (red).

Theorems & Definitions (22)

  • Lemma 3.1: Reparametrization Lemma
  • proof
  • Proposition 4.1
  • proof
  • Remark 1
  • Proposition 4.2
  • Theorem 4.1: Attention-Skip-Only Query Weight Elimination
  • Theorem 4.2: Weight-Shared Query Elimination
  • Remark 2
  • Lemma 8.1
  • ...and 12 more