Table of Contents
Fetching ...

Decoupling Positional and Symbolic Attention Behavior in Transformers

Felipe Urrutia, Jorge Salas, Alexander Kozachinskiy, Cristian Buc Calderon, Hector Pasten, Cristobal Rojas

TL;DR

The paper addresses how Transformer attention mediated by RoPE can separately encode positional and symbolic information. It defines formal, mutually exclusive positional and symbolic head behaviors, introduces a metric to map heads onto a positional–symbolic plane, and demonstrates that real models strongly align with RoPE frequency usage. Through canonical tasks and toy models, it shows that access to specific RoPE frequencies causally determines whether a head excels at positional or symbolic tasks, and that combining frequencies enables mixed tasks, with frequency gating providing a knob to control performance. The work highlights a fundamental tension between positional and symbolic processing and proposes a framework to analyze, visualize, and potentially engineer inductive biases in RoPE-equipped Transformers. These insights offer a principled path to optimize long-context and information-retrieval capabilities by frequency-aware design of RoPE-based attention heads.

Abstract

An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.

Decoupling Positional and Symbolic Attention Behavior in Transformers

TL;DR

The paper addresses how Transformer attention mediated by RoPE can separately encode positional and symbolic information. It defines formal, mutually exclusive positional and symbolic head behaviors, introduces a metric to map heads onto a positional–symbolic plane, and demonstrates that real models strongly align with RoPE frequency usage. Through canonical tasks and toy models, it shows that access to specific RoPE frequencies causally determines whether a head excels at positional or symbolic tasks, and that combining frequencies enables mixed tasks, with frequency gating providing a knob to control performance. The work highlights a fundamental tension between positional and symbolic processing and proposes a framework to analyze, visualize, and potentially engineer inductive biases in RoPE-equipped Transformers. These insights offer a principled path to optimize long-context and information-retrieval capabilities by frequency-aware design of RoPE-based attention heads.

Abstract

An important aspect subtending language understanding and production is the ability to independently encode positional and symbolic information of the words within a sentence. In Transformers, positional information is typically encoded using Positional Encodings (PEs). One such popular PE, namely Rotary PE (RoPE), has been widely used due to its empirical success. Recently, it has been argued that part of RoPE's success emerges from its ability to encode robust positional and semantic information using large and small frequencies, respectively. In this work, we perform a deeper dive into the positional versus symbolic dichotomy of attention heads behavior, both at the theoretical and empirical level. We provide general definitions of what it means for a head to behave positionally or symbolically, prove that these are two mutually exclusive behaviors and develop a metric to quantify them. We apply our framework to analyze Transformer-based LLMs using RoPE and find that all heads exhibit a strong correspondence between behavior and frequency use. Finally, we introduce canonical tasks designed to be either purely positional or symbolic, and demonstrate that the Transformer performance causally relates to the ability of attention heads to leverage the appropriate frequencies. In particular, we show that we can control the Transformer performance by controlling which frequencies the attention heads can access. Altogether, our work provides a detailed understanding of RoPE, and how its properties relate to model behavior.

Paper Structure

This paper contains 53 sections, 15 theorems, 75 equations, 16 figures.

Key Result

Theorem 1

Let $H$ be an arbitrary attention head with the logit function $L$. Let $\bar{x}=(x_1,...,x_{n})$ be an input and let $\lambda=(\lambda_1,...,\lambda_{n-1})$ be the sequence of logits on this input, excluding $L(x_n,n,x_n,n)$; namely, $\lambda_j = L(x_{n},n,x_j,j).$ Then, denoting by $\mu$ the avera

Figures (16)

  • Figure 1: Global and local analysis of attention head behavior.A. Each head in the positional–symbolic plane. B. Heatmaps of positional and symbolic scores for each head across all layers. C. For the same heads, we plot their positional and symbolic scores as a function of RoPE frequencies. By convention, lower frequency IDs correspond to higher angular frequencies, and conversely, higher frequency IDs correspond to lower angular frequencies. D. norms of the logits at each frequency for head (12:0). E. Attention weight patterns as a function of permutations. Top rows (green traces): Behavior of a symbolic head whose attention weight mass follows the permutations. Middle rows (red traces): Behavior of a positional head whose attention weight mass is invariant to the permutations. Bottom rows (blue traces): Behavior of a (mix) attention head with both high symbolic and positional scores, displaying a uniform mass with low attention weight scores. Note that symbolic, positional, and mix head behavior ar associated with low, relatively large, and the largest frequencies, respectively. F. Location on the positional-symbolic plane of head (12:0) as a function of the selected frequency.
  • Figure 2: Performance on the canonical tasks across training iterations and epochs.A. Tension between Index (positional) and Retrieval (symbolic) tasks. B. Accuracy in the partial induction task for 1-frequency and 2-frequency models. C. Frequency IDs mapped to the log-scaled angle axis.
  • Figure 3: A. Shapes of Accuracy for the Index (in red) and Retrieving (in green) tasks, per Epochs. B. Query (in blue) and key (in orange) vectors trajectories during training for the Index Task. C. Query/Key vector projections on a rotational plane from a gemma-2-2b-it's Head for some Binding Task's input ($64$ pairs).
  • Figure 4: Illustration using the same sequence as in Definition \ref{['def:act_pos_sym']}. Left: Original logit values. Middle: Example of invariant logit values after a simple permutation (swapping $x_j$ with $x_k$). Right: Example of equivariant logit values under the same permutation.
  • Figure 5: Positional--symbolic profiles of all attention heads in google/gemma-2-2b-it. Each subplot corresponds to an attention head, 26 layers, 8 heads per layer.
  • ...and 11 more figures

Theorems & Definitions (24)

  • Definition 1: Attention Head
  • Definition 2: Positional and Symbolic attention
  • Theorem 1: The positional-symbolic exclusion principle
  • Theorem 2
  • Theorem 3
  • Theorem 4
  • Theorem 5
  • Corollary 1
  • Theorem 6
  • Theorem 7
  • ...and 14 more