Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Yijiong Yu; Huiqiang Jiang; Xufang Luo; Qianhui Wu; Chin-Yew Lin; Dongsheng Li; Yuqing Yang; Yongfeng Huang; Lili Qiu

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Yijiong Yu, Huiqiang Jiang, Xufang Luo, Qianhui Wu, Chin-Yew Lin, Dongsheng Li, Yuqing Yang, Yongfeng Huang, Lili Qiu

TL;DR

This work investigates the persistent 'lost in the middle' position bias in long-context LLMs and identifies positional information embedded in hidden states, shaped by the causal attention mask, as a key factor beyond position embeddings. It introduces a practical mitigation by identifying a positional hidden state channel $h_t$ and scaling it with a factor $s<1$, primarily affecting the last-token attention, to rebalance attention across the prompt. The authors propose a monotonicity- and smoothness-based channel-search algorithm and validate their approach across a broad set of open-source models and long-context tasks, achieving up to $15.2\%$ gains on NaturalQuestion and KV retrieval, with modest or no degradation on other capabilities. The results suggest a generalizable, low-overhead strategy for mitigating position bias that can complement existing RoPE- or SFT-based methods, with broad implications for robust long-context reasoning in LLMs.

Abstract

Large Language Models (LLMs) are increasingly applied in various real-world scenarios due to their excellent generalization capabilities and robust generative abilities. However, they exhibit position bias, also known as "lost in the middle", a phenomenon that is especially pronounced in long-context scenarios, which indicates the placement of the key information in different positions of a prompt can significantly affect accuracy. This paper first explores the micro-level manifestations of position bias, concluding that attention weights are a micro-level expression of position bias. It further identifies that, in addition to position embeddings, causal attention mask also contributes to position bias by creating position-specific hidden states. Based on these insights, we propose a method to mitigate position bias by scaling this positional hidden states. Experiments on the NaturalQuestions Multi-document QA, KV retrieval, LongBench and timeline reorder tasks, using various models including RoPE models, context windowextended models, and Alibi models, demonstrate the effectiveness and generalizability of our approach. Our method can improve performance by up to 15.2% by modifying just one dimension of hidden states. Our code is available at https://aka.ms/PositionalHidden.

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

TL;DR

and scaling it with a factor

, primarily affecting the last-token attention, to rebalance attention across the prompt. The authors propose a monotonicity- and smoothness-based channel-search algorithm and validate their approach across a broad set of open-source models and long-context tasks, achieving up to

gains on NaturalQuestion and KV retrieval, with modest or no degradation on other capabilities. The results suggest a generalizable, low-overhead strategy for mitigating position bias that can complement existing RoPE- or SFT-based methods, with broad implications for robust long-context reasoning in LLMs.

Abstract

Paper Structure (36 sections, 4 equations, 15 figures, 6 tables, 1 algorithm)

This paper contains 36 sections, 4 equations, 15 figures, 6 tables, 1 algorithm.

Introduction
Positional Information in hidden states affects position bias
Microscopic Manifestations of Position Bias in Transformers: Attention Weight Patterns
Positional Information in Hidden States Also Contributes to Position Bias
Position Information can be clearly manifested in Specific Hidden states Channels
Mitigating Position Bias
Problem Formulation
Identifying Positional Hidden States
Scaling Positional Hidden States
Experiments
Setup
Evaluation Tasks and Models
Implementation Details
Baselines
Main Results
...and 21 more sections

Figures (15)

Figure 1: The relationship between causal mask, positional information in hidden states, positional hidden states, position embedding, attention pattern and position bias of model's performance.
Figure 2: Attention distribution of the gold KV pair to each KV pair across different positions on the KV retrieval task liu_lost_2023 using Mistral-7B jiang2023mistral. (a) and (b) show the results averaged across all heads of the layer. (c) shows the attention of the ground-truth KV to the ground-truth KV (i.e., diagonal lines from (b)) across different context lengths.
Figure 3: Performance and attention of different methods with the ground-truth KV at different positions in the KV retrieval task liu_lost_2023 using Mistral-7B jiang2023mistral.
Figure 4: Hidden states values with the token positions of the positional channel averaged across all layers.
Figure 5: The framework of scaling positional hidden states and modifying attention.
...and 10 more figures

Theorems & Definitions (1)

Definition 2.1: Positional Hidden States

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

TL;DR

Abstract

Mitigate Position Bias in Large Language Models via Scaling a Single Dimension

Authors

TL;DR

Abstract

Table of Contents

Figures (15)

Theorems & Definitions (1)