Table of Contents
Fetching ...

EntropyLong: Effective Long-Context Training via Predictive Uncertainty

Junlong Jia, Ziyang Chen, Xing Wu, Chaochen Gao, Zijia Lin, Debing Zhang, Songlin Hu, Binghui Guo

TL;DR

EntropyLong tackles the data bottleneck for long-context language models by identifying high-entropy positions and, through adaptive, uncertainty-guided retrieval, empirically verifying that retrieved distant contexts reduce predictive entropy. The four-stage pipeline—high-entropy position selection, information-theoretic retrieval, entropy-reduction verification, and strategic concatenation—creates training sequences with genuine long-range dependencies, validated on FineWeb-Edu and Cosmopedia to form 128K-length data with an average information gain of $\bar{\Delta I} = 0.68$. Empirical results show strong improvements on the RULER benchmark across context lengths and substantial gains on LongBench-v2 after instruction tuning, supported by ablations confirming the necessity of verification and the importance of threshold choices. The work introduces a principled, model-in-the-loop data curation approach that yields robust long-context understanding and provides a ready-to-open dataset to advance long-range reasoning in LLMs, with implications for scalable, information-theoretic pretraining.

Abstract

Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.

EntropyLong: Effective Long-Context Training via Predictive Uncertainty

TL;DR

EntropyLong tackles the data bottleneck for long-context language models by identifying high-entropy positions and, through adaptive, uncertainty-guided retrieval, empirically verifying that retrieved distant contexts reduce predictive entropy. The four-stage pipeline—high-entropy position selection, information-theoretic retrieval, entropy-reduction verification, and strategic concatenation—creates training sequences with genuine long-range dependencies, validated on FineWeb-Edu and Cosmopedia to form 128K-length data with an average information gain of . Empirical results show strong improvements on the RULER benchmark across context lengths and substantial gains on LongBench-v2 after instruction tuning, supported by ablations confirming the necessity of verification and the importance of threshold choices. The work introduces a principled, model-in-the-loop data curation approach that yields robust long-context understanding and provides a ready-to-open dataset to advance long-range reasoning in LLMs, with implications for scalable, information-theoretic pretraining.

Abstract

Training long-context language models to capture long-range dependencies requires specialized data construction. Current approaches, such as generic text concatenation or heuristic-based variants, frequently fail to guarantee genuine long-range dependencies. We propose EntropyLong, a novel data construction method that leverages predictive uncertainty to verify dependency quality. Our approach identifies high-entropy positions in documents, retrieves semantically relevant contexts from large corpora, and verifies their utility by assessing whether they reduce prediction entropy. This model-in-the-loop verification ensures each dependency represents measurable information gain rather than spurious correlation. We construct training samples with long-range dependencies by combining original documents with these verified contextual supplements. Using FineWebEdu and Cosmopedia, we generate a dataset of 128K-length sequences with verified dependencies. Models trained on this data demonstrate significant improvements on RULER benchmarks, particularly in tasks requiring distant information. Following instruction fine-tuning, our models also achieve substantial gains on LongBenchv2, demonstrating enhanced long-context understanding. Extensive ablation studies further validate the necessity and effectiveness of entropybased verification for long-context training.

Paper Structure

This paper contains 34 sections, 8 equations, 4 figures, 9 tables, 1 algorithm.

Figures (4)

  • Figure 1: Overview of the EntropyLong framework. Step 1. Adaptive Threshold-Based High-Entropy Position Selection: Identify high-entropy tokens (red) exceeding the adaptive threshold and generate queries for uncertain positions. Step 2. Information-Theoretic Context Retrieval: Retrieve relevant document chunks from large corpora using query-based search. Step 3. Entropy Reduction Verification: Verify whether retrieved chunks reduce entropy - retain successful chunks (green checkmark) and discard ineffective ones. Step 4. Strategic Concatenation: Shuffle verified chunks and concatenate with the root document to create training sequences with validated dependencies.
  • Figure 2: EntropyLong's attention patterns analysis. (a) Attention to correct answers vs NExtLong across different context chunks with answers at front; (b) Relative attention vs NExtLong with answers at different positions, where $\Delta A$ represents the attention difference.
  • Figure 3: EntropyLong's performance on the needle-in-a-haystack task within a 128K context window. The heatmap shows accuracy across different text lengths and needle positions, with darker colors indicating higher accuracy. EntropyLong achieves perfect accuracy across all configurations, successfully locating the target information regardless of position within the context.
  • Figure 4: An example of contextual uncertainty resolution. Retrieved context about Neurath's philosophical stance (disputing universal truths and Cartesian rationality) enables the model to confidently predict "criticism" by connecting this background knowledge with the original document's reference to Neurath's position toward Descartes' views on science.