Table of Contents
Fetching ...

LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

Jianghao Chen, Junhong Wu, Yangyifan Xu, Jiajun Zhang

TL;DR

This paper tackles the challenge of measuring and selecting high-quality long-context data for continual pre-training of LLMs. It introduces LADM, an attention-based framework that quantifies dependencies across long-context spans via a TinyLong Attention Calculator, yielding a Contextual Dependency Score (CDS) to guide data selection. Across perplexity, synthetic long-context tasks, and real-world benchmarks, LADM consistently outperforms random sampling and delta-perplexity baselines, achieving notable gains with only 1B training tokens. The work also provides efficiency analyses and ablations, highlighting when and why long-range dependency-aware data selection improves long-context modeling in diverse model families. This approach offers a practical path to better long-context performance while reducing training costs.

Abstract

Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

LADM: Long-context Training Data Selection with Attention-based Dependency Measurement for LLMs

TL;DR

This paper tackles the challenge of measuring and selecting high-quality long-context data for continual pre-training of LLMs. It introduces LADM, an attention-based framework that quantifies dependencies across long-context spans via a TinyLong Attention Calculator, yielding a Contextual Dependency Score (CDS) to guide data selection. Across perplexity, synthetic long-context tasks, and real-world benchmarks, LADM consistently outperforms random sampling and delta-perplexity baselines, achieving notable gains with only 1B training tokens. The work also provides efficiency analyses and ablations, highlighting when and why long-range dependency-aware data selection improves long-context modeling in diverse model families. This approach offers a practical path to better long-context performance while reducing training costs.

Abstract

Long-context modeling has drawn more and more attention in the area of Large Language Models (LLMs). Continual training with long-context data becomes the de-facto method to equip LLMs with the ability to process long inputs. However, it still remains an open challenge to measure the quality of long-context training data. To address this issue, we propose a Long-context data selection framework with Attention-based Dependency Measurement (LADM), which can efficiently identify high-quality long-context data from a large-scale, multi-domain pre-training corpus. LADM leverages the retrieval capabilities of the attention mechanism to capture contextual dependencies, ensuring a comprehensive quality measurement of long-context data. Experimental results show that our LADM framework significantly boosts the performance of LLMs on multiple long-context tasks with only 1B tokens for continual training.

Paper Structure

This paper contains 41 sections, 5 equations, 9 figures, 9 tables.

Figures (9)

  • Figure 1: The overall framework of LADM. We first train a Long Attention Calculator, then calculate the Pairwise Focus Score (PFS) to measure the dependency between spans. Then, we compute the Aggregated Focus Score (AFS) of each span and merge them as the Contextual Dependency Score (CDS) of a single long-context sample.
  • Figure 2: The median attention scores under different 32K data sample construction methods.
  • Figure 3: The "Needle-in-the-Haystack" task performance of different data selection methods. The x-axis denotes the evaluation context length, and the y-axis denotes insertion depths of the "needle".
  • Figure 4: Median scores for various data categories from the Pile dataset under ProLong and LADM framework.
  • Figure 5: Data length distribution of the Pile dataset.
  • ...and 4 more figures