Table of Contents
Fetching ...

LongAttn: Selecting Long-context Training Data via Token-level Attention

Longyun Wu, Dawei Zhu, Guangxiang Zhao, Zhuocheng Yu, Junfeng Ran, Xiangyu Wong, Lin Sun, Sujian Li

TL;DR

LongAttn tackles the challenge of cultivating long-context capabilities in LLMs by shifting data selection from sentence-level cues to token-level dependency signals derived from self-attention. It defines the token-level dependency strength $DS_T$ and distribution uniformity $DU_T$, combines them into a robust long-distance dependency score $LDS_T$, and uses this to filter high-quality long-context data, yielding the LongABC-32K dataset. Through extensive experiments, LongAttn demonstrates superior long-context retrieval, strong benchmarks across 32k contexts, and favorable efficiency relative to sentence-level methods, while scaling with larger models. The approach enables more data-efficient long-context pre-training and provides open data and code to accelerate future research in long-context modeling.

Abstract

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.

LongAttn: Selecting Long-context Training Data via Token-level Attention

TL;DR

LongAttn tackles the challenge of cultivating long-context capabilities in LLMs by shifting data selection from sentence-level cues to token-level dependency signals derived from self-attention. It defines the token-level dependency strength and distribution uniformity , combines them into a robust long-distance dependency score , and uses this to filter high-quality long-context data, yielding the LongABC-32K dataset. Through extensive experiments, LongAttn demonstrates superior long-context retrieval, strong benchmarks across 32k contexts, and favorable efficiency relative to sentence-level methods, while scaling with larger models. The approach enables more data-efficient long-context pre-training and provides open data and code to accelerate future research in long-context modeling.

Abstract

With the development of large language models (LLMs), there has been an increasing need for significant advancements in handling long contexts. To enhance long-context capabilities, constructing high-quality training data with long-range dependencies is crucial. Existing methods to select long-context data often rely on sentence-level analysis, which can be greatly optimized in both performance and efficiency. In this paper, we propose a novel token-level framework, LongAttn, which leverages the self-attention mechanism of LLMs to measure the long-range dependencies for the data. By calculating token-level dependency strength and distribution uniformity of token scores, LongAttn effectively quantifies long-range dependencies, enabling more accurate and efficient data selection. We filter LongABC-32K from open-source long-context datasets (ArXiv, Book, and Code). Through our comprehensive experiments, LongAttn has demonstrated its excellent effectiveness, scalability, and efficiency. To facilitate future research in long-context data, we released our code and the high-quality long-context training data LongABC-32K.

Paper Structure

This paper contains 37 sections, 6 equations, 3 figures, 11 tables, 1 algorithm.

Figures (3)

  • Figure 1: (a) How to measure long-range dependencies at the token level by using the self-attention mechanism. $DS_T$ indicates that the tokens in this data have strong long-distance dependencies, while $DU_T$ prevents negative impacts from individual tokens' high scores. (b) The comparison of long-context retrieval capabilities of models trained with different scales of tokens selected randomly, with sentence-level ProLong, and with LongAttn (ours).
  • Figure 2: LongAttn Framework: After preprocessing the data, the long-distance dependency strength at the token-level is analyzed using the self-attention mechanism of an LLM. This analysis serves as the basis for filtering the data, which is then used for continual pre-training of a base model that initially lacks long-context capabilities, resulting in our LongAttn model
  • Figure 3: (a) and (b) show the performance of other long-context LLMs and LongAttn-trained models on the RULER and complex NIAH tasks. (c) and (d) show the performance of models trained with different methods on the same tasks. Toge. and LLORA represent Together and LongLORA, respectively. 5B-LA and 10B-LA represent models trained on 5B and 10B tokens selected by LongAttn. LA-8 and LA-70 represent LongAttn based on 8B and 70B models, respectively.