MuDAF: Long-Context Multi-Document Attention Focusing through Contrastive Learning on Attention Heads
Weihao Liu, Ning Wu, Shiping Yang, Wenbiao Ding, Shining Liang, Ming Gong, Dongmei Zhang
TL;DR
This work tackles the problem of distracted attention in large language models when processing ultra-long, multi-document inputs. It introduces MuDAF, a head-level contrastive learning framework that optimizes query-key projections to bias attention toward relevant golden passages in MDQA tasks. Empirical results on long-context QA benchmarks show substantial improvements over vanilla fine-tuning and competitive performance relative to GPT-4o, while analyses reveal retrieval-head differences between MDQA and NIAH tests and highlight limits in scaling the number of targeted heads. The findings offer a practical approach to enhancing long-context reasoning by directly shaping attention distributions at the head level, with implications for retrieval-augmented generation and multi-document reasoning.
Abstract
Large Language Models (LLMs) frequently show distracted attention due to irrelevant information in the input, which severely impairs their long-context capabilities. Inspired by recent studies on the effectiveness of retrieval heads in long-context factutality, we aim at addressing this distraction issue through improving such retrieval heads directly. We propose Multi-Document Attention Focusing (MuDAF), a novel method that explicitly optimizes the attention distribution at the head level through contrastive learning. According to the experimental results, MuDAF can significantly improve the long-context question answering performance of LLMs, especially in multi-document question answering. Extensive evaluations on retrieval scores and attention visualizations show that MuDAF possesses great potential in making attention heads more focused on relevant information and reducing attention distractions.
