What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

Yutong Hu; Quzhe Huang; Kangcheng Luo; Yansong Feng

What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

Yutong Hu, Quzhe Huang, Kangcheng Luo, Yansong Feng

TL;DR

The paper investigates which tokens gain predictive advantage from distant text in long-context LLMs by analyzing token-level perplexity reductions as context length grows from $K$ to $2K$ across three long-context models. It reveals that content words, especially nouns and adjectives, and the first token of a word benefit most, with N-gram occurrences in extended context strongly enhancing predictions; token priors from pretraining further modulate these effects. The study demonstrates that longer contexts lead to sharper, more confident distributions, contributing to overall perplexity decreases and implying potential overconfidence in long-context LLMs. These insights illuminate how distant text is utilized in long-context language modeling and offer guidance for building more reliable long-context systems.

Abstract

As the context length that large language models can handle continues to increase, these models demonstrate an enhanced ability to utilize distant information for tasks such as language modeling. This capability contrasts with human reading and writing habits, where it is uncommon to remember and use particularly distant information, except in cases of foreshadowing. In this paper, we aim to explore which kinds of words benefit more from long contexts in language models. By analyzing the changes in token probabilities with increasing context length, we find that content words (e.g., nouns, adjectives) and the initial tokens of words benefit the most. Frequent patterns in the context (N-grams) also significantly impact predictions. Additionally, the model's prior knowledge plays a crucial role in influencing predictions, especially for rare tokens. We also observe that language models become more confident with longer contexts, resulting in sharper probability distributions. This overconfidence may contribute to the increasing probabilities of tokens with distant contextual information. We hope that our analysis will help the community better understand long-text language modeling and contribute to the design of more reliable long-context models.

What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

TL;DR

The paper investigates which tokens gain predictive advantage from distant text in long-context LLMs by analyzing token-level perplexity reductions as context length grows from

across three long-context models. It reveals that content words, especially nouns and adjectives, and the first token of a word benefit most, with N-gram occurrences in extended context strongly enhancing predictions; token priors from pretraining further modulate these effects. The study demonstrates that longer contexts lead to sharper, more confident distributions, contributing to overall perplexity decreases and implying potential overconfidence in long-context LLMs. These insights illuminate how distant text is utilized in long-context language modeling and offer guidance for building more reliable long-context systems.

Abstract

Paper Structure (25 sections, 13 equations, 6 figures, 5 tables)

This paper contains 25 sections, 13 equations, 6 figures, 5 tables.

Introduction
Preliminary
Perplexity
Perplexity Decreases as Context Length increases
Experimental Setup
Models
Dataset
Setup
Most Tokens' Token-perplexity Decrease
What Tokens Benefit from Distant Text
Properties of Words
Lexical Property.
Structures inside Words
Influence of Context
Effect of N-gram's Occurrence.
...and 10 more sections

Figures (6)

Figure 1: Left part: an illustration for sliding window method of perplexity calculation. Right part: an illustration of original context and new context.
Figure 2: The average token-perplexity decrement in each class of POS tags.
Figure 3: $\Delta D$ of each class of POS tags.
Figure 4: Correlation coefficients between the token-perplexity decrement $\Delta \bar{p}$ and the N-gram's new occurrence ratio $\Delta \mathcal{N}$ under different values of N.
Figure 5: The entropy $E_K$ and the max probability $MP$ of groups T and F respectively.
...and 1 more figures

What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

TL;DR

Abstract

What Kinds of Tokens Benefit from Distant Text? An Analysis on Long Context Language Modeling

Authors

TL;DR

Abstract

Table of Contents

Figures (6)