Table of Contents
Fetching ...

KV Shifting Attention Enhances Language Modeling

Mingyu Xu, Wei Cheng, Bingning Wang, Weipeng Chen

TL;DR

The paper addresses the induction-head bottleneck in decode-only transformers by introducing KV shifting attention, which decouples keys and values and incorporates a small set of learnable scalars per head. The authors prove that a one-layer KV-shifted head can emulate induction-head behavior and show theoretically that this reduces the depth and width requirements compared to traditional multi-layer setups. Empirically, KV shifting accelerates learning and improves performance across toy models and large pretraining scales (2.9B and 19B), with favorable scaling laws and robust results under varying seeds and hyperparameters. The approach also maintains compatibility with existing training/inference pipelines and offers potential benefits for mechanistic interpretability. Overall, KV shifting attention provides a lightweight, scalable enhancement to language modeling by better enabling induction heads, with practical implications for efficient large-scale pretraining and downstream reasoning tasks.

Abstract

The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.

KV Shifting Attention Enhances Language Modeling

TL;DR

The paper addresses the induction-head bottleneck in decode-only transformers by introducing KV shifting attention, which decouples keys and values and incorporates a small set of learnable scalars per head. The authors prove that a one-layer KV-shifted head can emulate induction-head behavior and show theoretically that this reduces the depth and width requirements compared to traditional multi-layer setups. Empirically, KV shifting accelerates learning and improves performance across toy models and large pretraining scales (2.9B and 19B), with favorable scaling laws and robust results under varying seeds and hyperparameters. The approach also maintains compatibility with existing training/inference pipelines and offers potential benefits for mechanistic interpretability. Overall, KV shifting attention provides a lightweight, scalable enhancement to language modeling by better enabling induction heads, with practical implications for efficient large-scale pretraining and downstream reasoning tasks.

Abstract

The current large language models are mainly based on decode-only structure transformers, which have great in-context learning (ICL) capabilities. It is generally believed that the important foundation of its ICL capability is the induction heads mechanism, which requires at least two layers attention. In order to more efficiently implement the ability of the model's induction, we revisit the induction heads mechanism and proposed a KV shifting attention. We theoretically prove that the KV shifting attention reducing the model's requirements for the depth and width of the induction heads mechanism. Our experimental results demonstrate that KV shifting attention is beneficial to learning induction heads and language modeling, which lead to better performance or faster convergence from toy models to the pre-training models with more than 10 B parameters.

Paper Structure

This paper contains 45 sections, 3 theorems, 14 equations, 10 figures, 7 tables.

Key Result

Theorem 1

(Modify from wang2024transformers). There exists a constant $C > 0$ and a two-layer single-head transformer $\text{TF}$(without FFNs), with $D = 2d$, $W_K^{(1,1)} = W_Q^{(1,1)} = 0$, $p^{(2)} = m$, ($p^{(i)}$ means the Alibi bias in $i^{th}$ layers), and $\|W_K^{(2,1)}\|, \|W_Q^{(2,1)}\| \leq O(1,1/

Figures (10)

  • Figure 1: On the left, as the training step size increases, the accuracy of induction varies among different models. In this setting, the only difference between Vanilla and KV shifting attention is the calculation of key and value. The total parameters of Vanilla and KV shifting attention with one layers is the same. And the parameters of Vanilla with 2 layers is twice. On the right is the induction accuracy with different hidden size. There are two layers in Vanilla model, and one layer in KV shifting attention, which means Vanilla model has two times parameters than KV shifting attention.
  • Figure 2: Contour lines and gradient decent derection of $L$. We simplified $O(T)$ as a constant, and $\alpha_2=1- \alpha_1$ and $\beta_2=1- \beta_1$. Induction heads means $(\alpha_1,\beta_1) = (0,1)$.
  • Figure 3: Accuracy of learning 3-gram text using models of different sizes. In this experiments, there are 50M parameters model with 4 layers, 0.4M parameters model with 2 layers, 0.8K parameters model with 1 layer.
  • Figure 4: Training loss curve. We train 2.9B model with 500B tokens, and 19B models with 200B tokens.
  • Figure 5: Training loss of 1.5B parameters model among random seeds and learning rate (LR).
  • ...and 5 more figures

Theorems & Definitions (4)

  • Definition 1
  • Theorem 1
  • Theorem 2
  • Theorem 3