Table of Contents
Fetching ...

Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

Youxiang Zhu, Ruochen Li, Danqing Wang, Daniel Haehn, Xiaohui Liang

TL;DR

The paper addresses the problem of LLM distraction by irrelevant long-context information. It identifies contextual heads that regulate overall attention using a contextual scoring method and shows that increasing their attention to relevant contexts boosts downstream performance; it then introduces focus directions, located in the key and query activations, to bias attention toward relevant contexts without external labels. Focus directions are learned by maximizing attention to relevant spans and can be applied at inference with an intervention factor $\alpha$ using a split-softmax reweighting scheme with exponent $\tau$. Across HELMET benchmarks and multiple model families, positive focus directions mitigate distraction and improve long-context task alignment, suggesting a practical approach for improving long-context LLM performance and alignment.

Abstract

Long-context large language models (LLMs) are prone to be distracted by irrelevant contexts. The reason for distraction remains poorly understood. In this paper, we first identify the contextual heads, a special group of attention heads that control the overall attention of the LLM. Then, we demonstrate that distraction arises when contextual heads fail to allocate sufficient attention to relevant contexts and can be mitigated by increasing attention to these contexts. We further identify focus directions, located at the key and query activations of these heads, which enable them to allocate more attention to relevant contexts without explicitly specifying which context is relevant. We comprehensively evaluate the effect of focus direction on various long-context tasks and find out focus directions could help to mitigate the poor task alignment of the long-context LLMs. We believe our findings could promote further research on long-context LLM alignment.

Focus Directions Make Your Language Models Pay More Attention to Relevant Contexts

TL;DR

The paper addresses the problem of LLM distraction by irrelevant long-context information. It identifies contextual heads that regulate overall attention using a contextual scoring method and shows that increasing their attention to relevant contexts boosts downstream performance; it then introduces focus directions, located in the key and query activations, to bias attention toward relevant contexts without external labels. Focus directions are learned by maximizing attention to relevant spans and can be applied at inference with an intervention factor using a split-softmax reweighting scheme with exponent . Across HELMET benchmarks and multiple model families, positive focus directions mitigate distraction and improve long-context task alignment, suggesting a practical approach for improving long-context LLM performance and alignment.

Abstract

Long-context large language models (LLMs) are prone to be distracted by irrelevant contexts. The reason for distraction remains poorly understood. In this paper, we first identify the contextual heads, a special group of attention heads that control the overall attention of the LLM. Then, we demonstrate that distraction arises when contextual heads fail to allocate sufficient attention to relevant contexts and can be mitigated by increasing attention to these contexts. We further identify focus directions, located at the key and query activations of these heads, which enable them to allocate more attention to relevant contexts without explicitly specifying which context is relevant. We comprehensively evaluate the effect of focus direction on various long-context tasks and find out focus directions could help to mitigate the poor task alignment of the long-context LLMs. We believe our findings could promote further research on long-context LLM alignment.

Paper Structure

This paper contains 16 sections, 5 equations, 9 figures, 13 tables.

Figures (9)

  • Figure 1: Overview of this work. We first introduce contextual scoring, measuring the attention distribution over inputs during response generation. Based on contextual scoring, we identify the contextual heads, which control the overall attention of LLMs. We further find out focus directions, which make LLMs pay more attention to the relevant contexts.
  • Figure 2: Location of the contextual heads.
  • Figure 3: Performance across different top-$k$ contextual/random heads and split softmax exponents $\tau$. Baseline: 20 documents (1 relevant, 19 irrelevant) case without intervention. Gold baseline: 1 relevant document case without intervention. Negative baseline: 19 irrelevant documents case without intervention.
  • Figure 4: EM accuracy of different top-$k$ heads and $\alpha$.
  • Figure 5: Location of the contextual heads of Qwen2.5-7B-Instruct.
  • ...and 4 more figures