Table of Contents
Fetching ...

How much do contextualized representations encode long-range context?

Simeng Sun, Cheng-Ping Hsieh

Abstract

We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric \emph{Anisotropy-Calibrated Cosine Similarity}, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures, and that fully recurrent models rely heavily on local context, whereas hybrid models more effectively encode the entire sequence structure. Finally, preliminary analysis of model size and training configurations on the encoding of long-range context suggest potential directions for improving existing language models.

How much do contextualized representations encode long-range context?

Abstract

We analyze contextual representations in neural autoregressive language models, emphasizing long-range contexts that span several thousand tokens. Our methodology employs a perturbation setup and the metric \emph{Anisotropy-Calibrated Cosine Similarity}, to capture the degree of contextualization of long-range patterns from the perspective of representation geometry. We begin the analysis with a case study on standard decoder-only Transformers, demonstrating that similar perplexity can exhibit markedly different downstream task performance, which can be explained by the difference in contextualization of long-range content. Next, we extend the analysis to other models, covering recent novel architectural designs and various training configurations. The representation-level results illustrate a reduced capacity for high-complexity (i.e., less compressible) sequences across architectures, and that fully recurrent models rely heavily on local context, whereas hybrid models more effectively encode the entire sequence structure. Finally, preliminary analysis of model size and training configurations on the encoding of long-range context suggest potential directions for improving existing language models.

Paper Structure

This paper contains 37 sections, 6 equations, 9 figures, 8 tables.

Figures (9)

  • Figure 1: Layerwise evolution of contextualized representations. We evaluate two settings that differ primarily in their anisotropy (expected cosine similarity), with the synthetic setting showing highly correlated representations, and consequently high self-similarity, despite diverse synthetic patterns in the prefix. Regardless of the input, representations become increasingly more contextualized by long-range prefix, as shown in the decreasing trend of ACCS.
  • Figure 2: Relationship between suffix perplexity, downstream task performance, and ACCS. Same perplexity can be reached when representations are contextualized by distant context to various degrees (measured by ACCS) and when the downstream task performance differs significantly.
  • Figure 3: Models exhibit increasingly anisotropic representations as prefixes become less compressible, or have high compression rate (i.e., compressed prefix size / raw prefix size, using LZMA compression).
  • Figure 4: We apply perturbations from the beginning of the prefix and gradually extend the right boundary towards suffix tokens (relative boundary = 1.0). RoPE-based Transformers (dashed lines) display low ACCS when perturbing the majority or all of the prefix, likely due to over-contextualization of noises in the prefix. Fully recurrent models (mLSTM, Mamba-2) and GPT with ALiBi demonstrate sudden drops in ACCS when perturbing nearby tokens, indicating stronger reliance on short-range context while minimally contextualized by distant prefix (plateau on the left). In contrast, hybrid models demonstrate a continuous downward trend, indicating more effective contextualization of the entire prefix.
  • Figure 5: We evaluate models on synthetic sequences with fully controlled patterns that become increasingly recognizable as sequence length grows. All models show increased contextualization of regularities, though fully recurrent models need some accumulation of patterns (initial flat lines). Interestingly, the larger 70b model encodes less prefix patterns at shorter sequence lengths but catches up with smaller models with larger context length.
  • ...and 4 more figures