Table of Contents
Fetching ...

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

Yi Lu, Xin Zhou, Wei He, Jun Zhao, Tao Ji, Tao Gui, Qi Zhang, Xuanjing Huang

TL;DR

LongHeads tackles the problem of processing long contexts in large language models by reinterpreting multi-head attention as a set of chunk-focused processors. It introduces a training-free, chunk-based mechanism that selects relevant chunks for each head and remaps positions to stay within the pretrained length, achieving linear-time complexity during inference. Empirical results on LLaMA-2 models show strong performance on language modeling, retrieval tasks, and long-context benchmarks, often matching or exceeding restricted-attention baselines and approaching full-attention methods with far lower cost. The approach highlights the potential of leveraging inherent attention structure to extend context without additional training, while also outlining limitations and avenues for further refinement.

Abstract

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .

LongHeads: Multi-Head Attention is Secretly a Long Context Processor

TL;DR

LongHeads tackles the problem of processing long contexts in large language models by reinterpreting multi-head attention as a set of chunk-focused processors. It introduces a training-free, chunk-based mechanism that selects relevant chunks for each head and remaps positions to stay within the pretrained length, achieving linear-time complexity during inference. Empirical results on LLaMA-2 models show strong performance on language modeling, retrieval tasks, and long-context benchmarks, often matching or exceeding restricted-attention baselines and approaching full-attention methods with far lower cost. The approach highlights the potential of leveraging inherent attention structure to extend context without additional training, while also outlining limitations and avenues for further refinement.

Abstract

Large language models (LLMs) have achieved impressive performance in numerous domains but often struggle to process lengthy inputs effectively and efficiently due to limited length generalization and attention's quadratic computational demands. Many sought to mitigate this by restricting the attention window within the pre-trained length. However, these methods introduce new issues such as ignoring the middle context and requiring additional training. To address these problems, we propose LongHeads, a training-free framework that enhances LLM's long context ability by unlocking multi-head attention's untapped potential. Instead of allowing each head to attend to the full sentence, which struggles with generalizing to longer sequences due to out-of-distribution (OOD) issues, we allow each head to process in-distribution length by selecting and attending to important context chunks. To this end, we propose a chunk selection strategy that relies on the inherent correlation between the query and the key representations, efficiently distributing context chunks to different heads. In this way, each head ensures it can effectively process attended tokens within the trained length, while different heads in different layers can collectively process longer contexts. LongHeads works efficiently in linear time, fits seamlessly with many LLMs that use relative positional encoding. LongHeads achieves 100% accuracy at the 128k length on passkey retrieval task, verifying LongHeads's efficacy in extending the usable context window for existing models. We release our code at https://github.com/LuLuLuyi/LongHeads .
Paper Structure (38 sections, 3 equations, 6 figures, 5 tables)

This paper contains 38 sections, 3 equations, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Left: Three types of long-context processors, (a) Attend all contexts but struggle with out-of-pre-trained length; (b) Attend local context to generate fluently but lose information; (c) Head attends short chunks andHeads attend Long context. Right: Accuracy of three specific methods on passkey retrieval task.
  • Figure 2: An overview of LongHeads's inference, generating token $x_{14}$ in the current step. During inference, LongHeads keeps the first chunk for stable computation, combined with the last chunk containing recent tokens.
  • Figure 3: Demonstration of Position Remapping.
  • Figure 4: The evaluation of passkey retrieval task at different context lengths. LongHeads achieves a comparable performance as Landmark Attention and outperforms other methods.
  • Figure 5: Visualization of chunks selected by different attention heads at each layer represented by color blocks. For the passkey retrieval task, the chunk containing the passkey is delineated with a red border. For the failed example, the red border encompasses two chunks due to the passkey-containing sentence coincidentally spanning two chunks.
  • ...and 1 more figures