Self-Consistency Falls Short! The Adverse Effects of Positional Bias on Long-Context Problems
Adam Byerly, Daniel Khashabi
TL;DR
The paper interrogates whether self-consistency (SC) scales to long-context problems, showing that position bias induces correlated errors that SC amplifies rather than mitigates. Through 651 experiments across eight models and nine long-context tasks, SC consistently degrades performance or yields negligible gains, across multiple aggregation methods and prompt formats. The authors identify position bias as the mechanistic culprit, evidencing U-shaped and monotonic error patterns in QA and text retrieval, and demonstrate that increasing model size only modestly reduces degradation without achieving improvements. The study highlights the need for bias-aware aggregation and context-aware inference strategies, suggesting directions such as position-aware voting, debiased sampling, and retrieval-augmented approaches for robust long-context understanding.
Abstract
Self-consistency (SC) improves the performance of large language models (LLMs) across various tasks and domains that involve short content. However, does this support its effectiveness for long-context problems? We challenge the assumption that SC's benefits generalize to long-context settings, where LLMs often struggle with position bias, the systematic over-reliance on specific context regions-which hinders their ability to utilize information effectively from all parts of their context. Through comprehensive experimentation with varying state-of-the-art models, tasks, and SC formulations, we find that SC not only fails to improve but actively degrades performance on long-context tasks. This degradation is driven by persistent position bias, which worsens with longer context lengths and smaller model sizes but remains invariant to prompt format or task type. Unlike short-context tasks, where SC diversifies reasoning paths, long-context SC amplifies positional errors. These comprehensive results provide valuable insight into the limitations of current LLMs in long-context understanding and highlight the need for more sophisticated approaches.
