Table of Contents
Fetching ...

Explaining Context Length Scaling and Bounds for Language Models

Jingzhe Shi, Qinwei Ma, Hongyi Liu, Hang Zhao, Jeng-Neng Hwang, Lei Li

TL;DR

This work proposes a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective, and conducts experiments on natural language and synthetic data, validating the proposed theoretical assumptions and deductions.

Abstract

Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at: https://github.com/JingzheShi/NLPCtlScalingAndBounds.

Explaining Context Length Scaling and Bounds for Language Models

TL;DR

This work proposes a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective, and conducts experiments on natural language and synthetic data, validating the proposed theoretical assumptions and deductions.

Abstract

Long Context Language Models have drawn great attention in the past few years. There has been work discussing the impact of long context on Language Model performance: some find that long irrelevant context could harm performance, while some experimentally summarize loss reduction by relevant long context as Scaling Laws. This calls for a more thorough understanding on how long context impacts Language Modeling. In this work, we (1) propose a clean and effective theoretical framework for explaining the impact of context length on Language Modeling, from an Intrinsic Space perspective; and (2) conduct experiments on natural language and synthetic data, validating our proposed theoretical assumptions and deductions. Our theoretical framework can provide practical insights such as establishing that training dataset size dictates an optimal context length and bounds context length scaling for certain cases. We hope our work may inspire new long context Language Models, as well as future work studying Physics for Language Models. Code for our experiments is available at: https://github.com/JingzheShi/NLPCtlScalingAndBounds.

Paper Structure

This paper contains 56 sections, 4 theorems, 48 equations, 13 figures, 2 tables.

Key Result

Theorem 1

Let $\mathcal{Z} \subseteq \mathbb R^d$ ($d\ge 1$) and there exists a non-empty open set $U\in \mathbb R^d$ such that $U\subseteq \mathcal{Z}$ (i.e., $\mathcal{Z}$ is a $d$-dimensional region). Let $q:\mathcal{Z}\to[0,\infty)$ be a probability density satisfying Draw i.i.d. samples $\mathcal{Z}_{D}=\{Z_{1},\dots,Z_{D}\}\sim q^{\otimes D}$ and define the capped nearest–neighbour distance Then, th

Figures (13)

  • Figure 1: Left: Validation Loss Gap vs. Context Length, measured on subsets of OpenWebText dataset, where we subtract the minimum loss grouped by context length from each curve (please refer to Figure \ref{['fig: exp on optimal context and dataset size']} for the original figure). We see for each training dataset size, there exists an optimal context length that minimizes pretraining validation loss, which increases with the dataset size (More details can be found in Section \ref{['sec: Optimal Context Length on LMs']}). Right: similar results obtained on our Synthetic Dataset in Section \ref{['subsection: synthetic dataset point 3']}. This proves the deduction of our theory.
  • Figure 2: Bayes Risk vs. Context Length: Bayes Risk is approximated by Cross Entropy loss measured with LLaMa-3.1 series on OpenWebText, for different context length.
  • Figure 3: Left figures: Relative Eigen Value for LLaMa-3.1-8B on a subset of OpenWebText, presented in different x-axis scales, with different context length visible to Language Model. Gray lines represent different thresholds we take to measure the intrinsic dimension of the current model. Right figures: Cross Entropy Loss vs. Measured Intrinsic Dimension. Each line represents a certain threshold used to measure ID in the intrinsic space of the used LLM. Different Measurements would give ID values that are linear w.r.t. each other, and they are all linear w.r.t. CE loss.
  • Figure 4: Left: An example of the 'two needles in a haystack' task, similar to those in sametaskmoretoken. The text part is the input to the Language Model, with key information and question visualized in blue; the figure part shows perplexity of the answer token $\langle 8 \rangle$ of LLaMa-3.1-8B (horizontal) vs. number of masked leftmost tokens (vertical). Although seeing both pieces of information are necessary to answer the question, perplexity rises dramatically only when the first piece of information is masked. Right: An example of our synthetic data. The answer for Subtask 1,2,3 is $0\oplus0=0,0\oplus1=1$ and $1\oplus1=0$ respectively, but since the thrid bit is $1$ for control bits, only Subtask 3 is activated and the final answer is $0$. However, for a model of context length $7$, it cannot see the $9th$ bit required by subtask 3, making it unable to predict the answer correctly.
  • Figure 5: Model trained on the proposed synthetic dataset; $\oplus$ represents feature concatenation. Only the first $l$ bits are used as input to context MLP when the context length is set to $l$. We conduct PCA on Context Feature to analyze the intrinsic dimension of input context bits for various context lengths.
  • ...and 8 more figures

Theorems & Definitions (8)

  • Theorem 1: Expected capped nearest–neighbour distance
  • proof
  • Theorem 2: Data Scaling for Approximation Loss
  • proof
  • Theorem 3: Bayes Risk and Context Length with Intrinsic Dimension Assumption
  • proof
  • Theorem 4: Bayes Risk and Context Length with Information Entropy Assumption
  • proof