LongAttn: Selecting Long-context Training Data via Token-level Attention
Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li
TL;DR
LongAttn tackles the challenge of cultivating long-context capabilities in LLMs by shifting data selection from sentence-level cues to token-level dependency signals derived from self-attention. It defines the token-level dependency strength $DS_T$ and distribution uniformity $DU_T$, combines them into a robust long-distance dependency score $LDS_T$, and uses this to filter high-quality long-context data, yielding the LongABC-32K dataset. Through extensive experiments, LongAttn demonstrates superior long-context retrieval, strong benchmarks across 32k contexts, and favorable efficiency relative to sentence-level methods, while scaling with larger models. The approach enables more data-efficient long-context pre-training and provides open data and code to accelerate future research in long-context modeling.
Abstract
With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.
