Table of Contents
Fetching ...

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

Mingyu Jin, Kai Mei, Wujiang Xu, Mingjie Sun, Ruixiang Tang, Mengnan Du, Zirui Liu, Yongfeng Zhang

TL;DR

The paper investigates why extremely large activations, or massive values, emerge specifically in the Q and K components of self-attention in RoPE-enabled LLMs. It demonstrates that these values underpin contextual knowledge understanding, rather than parametric knowledge retrieval, and that disrupting them disproportionately harms CK tasks while PK remains relatively stable. The authors show that RoPE drives the concentration of these values from the earliest layers and that quantization methods protecting massive values better preserve CK performance. These findings offer practical guidance for model design and optimization, including targeted quantization strategies and a deeper understanding of RoPE’s role in attention behavior.

Abstract

Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.

Massive Values in Self-Attention Modules are the Key to Contextual Knowledge Understanding

TL;DR

The paper investigates why extremely large activations, or massive values, emerge specifically in the Q and K components of self-attention in RoPE-enabled LLMs. It demonstrates that these values underpin contextual knowledge understanding, rather than parametric knowledge retrieval, and that disrupting them disproportionately harms CK tasks while PK remains relatively stable. The authors show that RoPE drives the concentration of these values from the earliest layers and that quantization methods protecting massive values better preserve CK performance. These findings offer practical guidance for model design and optimization, including targeted quantization strategies and a deeper understanding of RoPE’s role in attention behavior.

Abstract

Large language models (LLMs) have achieved remarkable success in contextual knowledge understanding. In this paper, we show that these concentrated massive values consistently emerge in specific regions of attention queries (Q) and keys (K) while not having such patterns in values (V) in various modern transformer-based LLMs (Q, K, and V mean the representations output by the query, key, and value layers respectively). Through extensive experiments, we further demonstrate that these massive values play a critical role in interpreting contextual knowledge (knowledge obtained from the current context window) rather than in retrieving parametric knowledge stored within the model's parameters. Our further investigation of quantization strategies reveals that ignoring these massive values leads to a pronounced drop in performance on tasks requiring rich contextual understanding, aligning with our analysis. Finally, we trace the emergence of concentrated massive values and find that such concentration is caused by Rotary Positional Encoding (RoPE), which has appeared since the first layers. These findings shed new light on how Q and K operate in LLMs and offer practical insights for model design and optimization. The Code is Available at https://github.com/MingyuJ666/Rope_with_LLM.

Paper Structure

This paper contains 30 sections, 16 equations, 46 figures, 11 tables.

Figures (46)

  • Figure 1: In transformer-based Large Language Models with RoPE (like Llama, Gemma), the attention queries (Q) and keys (K) exhibit concentrated massive values in certain dimensions.
  • Figure 2: Q and K Embedding Vector in Llama-2-7B, we choose Layer 10 and 20, and the input question is shown as \ref{['fig:prompt_in_inference_LLM']}. This visualization shown here is a two-dimensional image because we averaged over the sequence-length dimension. The horizontal axis is the number of head and the vertical axis is head dim. We can see that the massive value is concentrated at the bottom of the picture.
  • Figure 3: Disrupting massive values leads to higher perplexity and lower diversity, while disrupting non-massive values maintains model performance, particularly evident in IMDB dataset analysis.
  • Figure 4: We can observe that introducing conflicting background knowledge causes LLM to be misled into making random guesses. However, after massive values are disrupted, the model is still able to maintain a certain level of accuracy.
  • Figure 5: Impacts of different quantization methods on Llama3-8b across different benchmarks.
  • ...and 41 more figures

Theorems & Definitions (4)

  • Definition 1
  • Definition 2
  • Remark 3.1
  • Remark 3.2