Table of Contents
Fetching ...

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yi-An Ma, Yeyun Gong

TL;DR

This paper tackles the problem that much long-text data lacks meaningful long-range dependencies, which hampers efficient long-context pretraining. It introduces LongFilter, a data-selection framework that quantifies the information gain from extended context by comparing next-token predictions under long versus short contexts and using a token-level surrogate KL divergence. By applying LongFilter to curate high-quality long-text data, the authors demonstrate substantial improvements in long-context benchmarks (HELMET, LongBench, RULER) for a LLaMA-3-8B model extended from 8K to 64K context, with data-efficiency gains such as achieving 90+ on HELMET recall with far fewer tokens. The work provides an effective, scalable pathway to unlock long-range capabilities through data quality rather than architectural changes, and releases code for reproducibility and reuse in long-context pretraining pipelines.

Abstract

Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

TL;DR

This paper tackles the problem that much long-text data lacks meaningful long-range dependencies, which hampers efficient long-context pretraining. It introduces LongFilter, a data-selection framework that quantifies the information gain from extended context by comparing next-token predictions under long versus short contexts and using a token-level surrogate KL divergence. By applying LongFilter to curate high-quality long-text data, the authors demonstrate substantial improvements in long-context benchmarks (HELMET, LongBench, RULER) for a LLaMA-3-8B model extended from 8K to 64K context, with data-efficiency gains such as achieving 90+ on HELMET recall with far fewer tokens. The work provides an effective, scalable pathway to unlock long-range capabilities through data quality rather than architectural changes, and releases code for reproducibility and reuse in long-context pretraining pipelines.

Abstract

Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

Paper Structure

This paper contains 26 sections, 10 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: An illustration of the token-level long-context information gain. Given only the Short Context ($S$) "I hate this", the predictive distribution for the next token has high entropy, as many words ('song', 'thing', 'movie') are plausible. The Extended Context ($E$), "The plot was a mess...", provides critical information that reduces this entropy, concentrating the probability on "movie".
  • Figure 2: Workflow of LongFilter. The Upper part computes the next-token probability distribution using a short-context sliding window (shown as 4 tokens for illustration, though our experiments use 4K), while the Lower part computes it using the full long context. LongFilter then scores the information gain (Middle part) by calculating a token-level surrogate KL divergence between these two distributions. This gain is low for locally predictable tokens (such as 'was'), but high for tokens that require extended context (such as 'movie'). Finally, these token-level scores are aggregated to produce a single score for the entire data instance.
  • Figure 3: Performance on Recall tasks (Needle-in-a-Haystack) w.r.t trained tokens.
  • Figure 4: Performance on HELMET with respect to trained tokens.
  • Figure 5: Token-level context score analysis on a subset of the processed SlimPajama-Arxiv dataset. In the top examples, the color of each token is determined by its score: the darker the color, the higher the score. (a) A high-scoring segment of well-formed academic prose from a PhD thesis. (b) A low-scoring segment containing non-prose LaTeX TikZ drawing commands. (c) Context Scores across the full token sequence for documents of the top three ranks and the bottom three ranks.