Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng; Yingyu Lin; Zhenghao Lin; Xiao Liu; Yizhou Sun; Yi-An Ma; Yeyun Gong

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

Haoran Deng, Yingyu Lin, Zhenghao Lin, Xiao Liu, Yizhou Sun, Yi-An Ma, Yeyun Gong

TL;DR

This paper tackles the problem that much long-text data lacks meaningful long-range dependencies, which hampers efficient long-context pretraining. It introduces LongFilter, a data-selection framework that quantifies the information gain from extended context by comparing next-token predictions under long versus short contexts and using a token-level surrogate KL divergence. By applying LongFilter to curate high-quality long-text data, the authors demonstrate substantial improvements in long-context benchmarks (HELMET, LongBench, RULER) for a LLaMA-3-8B model extended from 8K to 64K context, with data-efficiency gains such as achieving 90+ on HELMET recall with far fewer tokens. The work provides an effective, scalable pathway to unlock long-range capabilities through data quality rather than architectural changes, and releases code for reproducibility and reuse in long-context pretraining pipelines.

Abstract

Long-context language models unlock advanced capabilities in reasoning, code generation, and document summarization by leveraging dependencies across extended spans of text. However, a significant portion of readily available long-text data lacks meaningful long-distance dependencies; most spans can be predicted using only local context. Training on such data is inefficient, making careful data selection crucial. Therefore, we introduce LongFilter, a framework for curating training data tailored to long-context pretraining. LongFilter measures the information gain provided by extended context by contrasting model predictions under long-context versus short-context settings, thereby identifying samples where long-range dependencies are essential. Experiments with LLaMA-3-8B, extending its context length from 8K to 64K, show that LongFilter efficiently selects high-quality data and yields substantial improvements on benchmarks such as HELMET, LongBench, and RULER.

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

TL;DR

Abstract

Beyond Length: Quantifying Long-Range Information for Long-Context LLM Pretraining Data

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (5)