Table of Contents
Fetching ...

Self-Consistency Falls Short! The Adverse Effects of Positional Bias on Long-Context Problems

Adam Byerly, Daniel Khashabi

TL;DR

The paper interrogates whether self-consistency (SC) scales to long-context problems, showing that position bias induces correlated errors that SC amplifies rather than mitigates. Through 651 experiments across eight models and nine long-context tasks, SC consistently degrades performance or yields negligible gains, across multiple aggregation methods and prompt formats. The authors identify position bias as the mechanistic culprit, evidencing U-shaped and monotonic error patterns in QA and text retrieval, and demonstrate that increasing model size only modestly reduces degradation without achieving improvements. The study highlights the need for bias-aware aggregation and context-aware inference strategies, suggesting directions such as position-aware voting, debiased sampling, and retrieval-augmented approaches for robust long-context understanding.

Abstract

Self-consistency (SC) improves the performance of large language models (LLMs) across various tasks and domains that involve short content. However, does this support its effectiveness for long-context problems? We challenge the assumption that SC's benefits generalize to long-context settings, where LLMs often struggle with position bias, the systematic over-reliance on specific context regions-which hinders their ability to utilize information effectively from all parts of their context. Through comprehensive experimentation with varying state-of-the-art models, tasks, and SC formulations, we find that SC not only fails to improve but actively degrades performance on long-context tasks. This degradation is driven by persistent position bias, which worsens with longer context lengths and smaller model sizes but remains invariant to prompt format or task type. Unlike short-context tasks, where SC diversifies reasoning paths, long-context SC amplifies positional errors. These comprehensive results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches.

Self-Consistency Falls Short! The Adverse Effects of Positional Bias on Long-Context Problems

TL;DR

The paper interrogates whether self-consistency (SC) scales to long-context problems, showing that position bias induces correlated errors that SC amplifies rather than mitigates. Through 651 experiments across eight models and nine long-context tasks, SC consistently degrades performance or yields negligible gains, across multiple aggregation methods and prompt formats. The authors identify position bias as the mechanistic culprit, evidencing U-shaped and monotonic error patterns in QA and text retrieval, and demonstrate that increasing model size only modestly reduces degradation without achieving improvements. The study highlights the need for bias-aware aggregation and context-aware inference strategies, suggesting directions such as position-aware voting, debiased sampling, and retrieval-augmented approaches for robust long-context understanding.

Abstract

Self-consistency (SC) improves the performance of large language models (LLMs) across various tasks and domains that involve short content. However, does this support its effectiveness for long-context problems? We challenge the assumption that SC's benefits generalize to long-context settings, where LLMs often struggle with position bias, the systematic over-reliance on specific context regions-which hinders their ability to utilize information effectively from all parts of their context. Through comprehensive experimentation with varying state-of-the-art models, tasks, and SC formulations, we find that SC not only fails to improve but actively degrades performance on long-context tasks. This degradation is driven by persistent position bias, which worsens with longer context lengths and smaller model sizes but remains invariant to prompt format or task type. Unlike short-context tasks, where SC diversifies reasoning paths, long-context SC amplifies positional errors. These comprehensive results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches.

Paper Structure

This paper contains 46 sections, 1 equation, 7 figures, 3 tables.

Figures (7)

  • Figure 1: Schematic of self-consistency in long-context, "needle-in-a-haystack" scenarios. Input consists of a query and multiple evidence documents, one of which contains the correct answer, with a model generating diverse intermediate answers via stochastic sampling (non-zero temperature). However, aggregation yields incorrect answers due to position bias, highlighting a key challenge in long-context reasoning. Sampling from a model with inherent position bias amplifies rather than mitigates errors, as all samples inherit the same structural biases, violating SC's core assumption of error independence.
  • Figure 2: Average performance difference distribution for Qwen (left) and LLaMA (right) models across all tasks (excluding NQ-Open) in Table \ref{['tab:main']}. The y-axis shows the difference in performance between SC and baseline approaches (negative values indicate degradation). Box plots show quartiles with whiskers extending to min/max values. Both model families demonstrate reduced performance degradation as model size increases, but even the largest models still fail to break-even.
  • Figure 3: Self-consistency accuracy across models for NQ-Open, showing positional bias. (a) QA accuracy is highest at the beginning or end of context, with LLaMA-3.1 degrading under SC across positions; a U-shaped pattern persists across context lengths and model sizes. The accuracy difference reveals an upward trend as the gold document position moves later in context--SC provides relatively less harm for later positions but never achieves the performance levels of baseline models on early positions. (b) TR accuracy also peaks at the start, with severe drops as context length and position of relevant information increase. The corresponding difference plots demonstrate consistent negative impact across positions, with particularly severe degradation (20-25%) for early positions in longer contexts.
  • Figure 4: Distribution of correctness across eight intermediate generations, demonstrating positional bias before self-consistency aggregation. The stacked bars show the percentage of cases where none (0), some (1-3), most (4-7), or all (8) of the intermediate samples were correct. (Top Row) For Question Answering (QA), the prevalence of highly correct sample sets (green bars) follows a U-shaped curve, degrading when information is in the middle. (Bottom Row) For Text Retrieval (TR), correctness declines monotonically, with a dramatic increase in cases where zero samples are correct as the gold document is placed later in longer contexts. This visualization confirms that the errors amplified by SC are systemic and correlated, originating from the model's fundamental bias rather than the aggregation process.
  • Figure 5: Effect of prompt format on QA and TR self-consistency accuracy for the Qwen-2.5-7B model. Different prompt formats show minimal impact on mitigating positional bias. While Q-Doc-Q slightly improves overall QA performance (top row), TR performance (bottom row) is more sensitive to format choice, with up to 20% performance degradation between formats. The consistent degradation pattern across gold positions indicates that position bias persists regardless of query-document ordering.
  • ...and 2 more figures