Table of Contents
Fetching ...

MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads

Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang

TL;DR

This work tackles the problem of distracted attention in large language models when processing ultra-long, multi-document inputs. It introduces MuDAF, a head-level contrastive learning framework that optimizes query-key projections to bias attention toward relevant golden passages in MDQA tasks. Empirical results on long-context QA benchmarks show substantial improvements over vanilla fine-tuning and competitive performance relative to GPT-4o, while analyses reveal retrieval-head differences between MDQA and NIAH tests and highlight limits in scaling the number of targeted heads. The findings offer a practical approach to enhancing long-context reasoning by directly shaping attention distributions at the head level, with implications for retrieval-augmented generation and multi-document reasoning.

Abstract

Large Language Models (LLMs) frequently show distracted attention due to irrelevant information in the input, which severely impairs their long-context capabilities. Inspired by recent studies on the effectiveness of retrieval heads in long-context factutality, we aim at addressing this distraction issue through improving such retrieval heads directly. We propose Multi-Document Attention Focusing (MuDAF), a novel method that explicitly optimizes the attention distribution at the head level through contrastive learning. According to the experimental results, MuDAF can significantly improve the long-context question answering performance of LLMs, especially in multi-document question answering. Extensive evaluations on retrieval scores and attention visualizations show that MuDAF possesses great potential in making attention heads more focused on relevant information and reducing attention distractions.

MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads

TL;DR

This work tackles the problem of distracted attention in large language models when processing ultra-long, multi-document inputs. It introduces MuDAF, a head-level contrastive learning framework that optimizes query-key projections to bias attention toward relevant golden passages in MDQA tasks. Empirical results on long-context QA benchmarks show substantial improvements over vanilla fine-tuning and competitive performance relative to GPT-4o, while analyses reveal retrieval-head differences between MDQA and NIAH tests and highlight limits in scaling the number of targeted heads. The findings offer a practical approach to enhancing long-context reasoning by directly shaping attention distributions at the head level, with implications for retrieval-augmented generation and multi-document reasoning.

Abstract

Large Language Models (LLMs) frequently show distracted attention due to irrelevant information in the input, which severely impairs their long-context capabilities. Inspired by recent studies on the effectiveness of retrieval heads in long-context factutality, we aim at addressing this distraction issue through improving such retrieval heads directly. We propose Multi-Document Attention Focusing (MuDAF), a novel method that explicitly optimizes the attention distribution at the head level through contrastive learning. According to the experimental results, MuDAF can significantly improve the long-context question answering performance of LLMs, especially in multi-document question answering. Extensive evaluations on retrieval scores and attention visualizations show that MuDAF possesses great potential in making attention heads more focused on relevant information and reducing attention distractions.

Paper Structure

This paper contains 37 sections, 17 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Given instructions, long documents and a specific question, LLMs can often be confused when facing information from multiple sources. Our method MuDAF helps LLMs focus on documents related to the given question. Deeper colors represent higher attention values.
  • Figure 2: An overview of our proposed method. The goal of MuDAF is to adjust the similarity between the Query features from the question and the Key features from the passages, thus making attention heads allocate more attention weights in relevant information and reducing distractions. CL means contrastive learning.
  • Figure 3: The F1 and EM retrieval scores for attention heads of Llama3.1-8B. We list top 16 retrieval heads ranked by their F1 scores in the inner graph.
  • Figure 4: Average performance of Llama3.1-8B on LongBench with different masking strategies. In this experiment, we used the MDQA subset of LongBench, including HotpotQA, 2WikiMQA and MuSiQue. Masked retrieval heads were also randomly selected from the set of retrieval heads, and the final results were obtained by averaging three independent experimental runs.
  • Figure 5: Comparison of enhanced retrieval capabilities in selected attention heads. We annotate the rank of each attention head among all heads above the bar (i.e., #x).
  • ...and 6 more figures