Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA
George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E Hunter, Katharina von der Wense
TL;DR
The paper investigates the 'Lost in the Middle' bias in long-context language models within multi-hop QA, where reasoning spans multiple documents. It systematically analyzes how the absolute and relative positions of evidence affect performance across GPT-3.5-turbo, MPT-7b-instruct, and Llama-2-7b-longlora on HotpotQA, WikiMultihopQA, and MuSiQue-Ans, under full, summarized, and knowledge-graph-reduced contexts. It evaluates mitigation strategies including chain-of-thought prompting and context-reduction techniques, finding that evidence adjacency and relative distance matter, CoT helps some models but not all, and context reduction reduces middle bias at the cost of overall accuracy. The results highlight the complexity of dispersed evidence in multi-hop reasoning and point to directions such as improved preprocessing, prompting, and memory mechanisms to enhance robustness of long-context QA systems.
Abstract
Previous work finds that recent long-context language models fail to make equal use of information in the middle of their inputs, preferring pieces of information located at the tail ends which creates an undue bias in situations where we would like models to be equally capable of using different parts of the input. Thus far, the problem has mainly only been considered in settings with single pieces of critical information, leading us to question what happens when multiple necessary pieces of information are spread out over the inputs. Here, we demonstrate the effects of the "lost in the middle" problem in the multi-hop question answering setting -- in which multiple reasoning "hops" over disconnected documents are required -- and show that performance degrades not only with respect to the distance of information from the edges of the context, but also between pieces of information. Additionally, we experiment with means of alleviating the problem by reducing superfluous document contents through knowledge graph triple extraction and summarization, and prompting models to reason more thoroughly using chain-of-thought prompting.
