Table of Contents
Fetching ...

Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA

George Arthur Baker, Ankush Raut, Sagi Shaier, Lawrence E Hunter, Katharina von der Wense

TL;DR

The paper investigates the 'Lost in the Middle' bias in long-context language models within multi-hop QA, where reasoning spans multiple documents. It systematically analyzes how the absolute and relative positions of evidence affect performance across GPT-3.5-turbo, MPT-7b-instruct, and Llama-2-7b-longlora on HotpotQA, WikiMultihopQA, and MuSiQue-Ans, under full, summarized, and knowledge-graph-reduced contexts. It evaluates mitigation strategies including chain-of-thought prompting and context-reduction techniques, finding that evidence adjacency and relative distance matter, CoT helps some models but not all, and context reduction reduces middle bias at the cost of overall accuracy. The results highlight the complexity of dispersed evidence in multi-hop reasoning and point to directions such as improved preprocessing, prompting, and memory mechanisms to enhance robustness of long-context QA systems.

Abstract

Previous work finds that recent long-context language models fail to make equal use of information in the middle of their inputs, preferring pieces of information located at the tail ends which creates an undue bias in situations where we would like models to be equally capable of using different parts of the input. Thus far, the problem has mainly only been considered in settings with single pieces of critical information, leading us to question what happens when multiple necessary pieces of information are spread out over the inputs. Here, we demonstrate the effects of the "lost in the middle" problem in the multi-hop question answering setting -- in which multiple reasoning "hops" over disconnected documents are required -- and show that performance degrades not only with respect to the distance of information from the edges of the context, but also between pieces of information. Additionally, we experiment with means of alleviating the problem by reducing superfluous document contents through knowledge graph triple extraction and summarization, and prompting models to reason more thoroughly using chain-of-thought prompting.

Lost in the Middle, and In-Between: Enhancing Language Models' Ability to Reason Over Long Contexts in Multi-Hop QA

TL;DR

The paper investigates the 'Lost in the Middle' bias in long-context language models within multi-hop QA, where reasoning spans multiple documents. It systematically analyzes how the absolute and relative positions of evidence affect performance across GPT-3.5-turbo, MPT-7b-instruct, and Llama-2-7b-longlora on HotpotQA, WikiMultihopQA, and MuSiQue-Ans, under full, summarized, and knowledge-graph-reduced contexts. It evaluates mitigation strategies including chain-of-thought prompting and context-reduction techniques, finding that evidence adjacency and relative distance matter, CoT helps some models but not all, and context reduction reduces middle bias at the cost of overall accuracy. The results highlight the complexity of dispersed evidence in multi-hop reasoning and point to directions such as improved preprocessing, prompting, and memory mechanisms to enhance robustness of long-context QA systems.

Abstract

Previous work finds that recent long-context language models fail to make equal use of information in the middle of their inputs, preferring pieces of information located at the tail ends which creates an undue bias in situations where we would like models to be equally capable of using different parts of the input. Thus far, the problem has mainly only been considered in settings with single pieces of critical information, leading us to question what happens when multiple necessary pieces of information are spread out over the inputs. Here, we demonstrate the effects of the "lost in the middle" problem in the multi-hop question answering setting -- in which multiple reasoning "hops" over disconnected documents are required -- and show that performance degrades not only with respect to the distance of information from the edges of the context, but also between pieces of information. Additionally, we experiment with means of alleviating the problem by reducing superfluous document contents through knowledge graph triple extraction and summarization, and prompting models to reason more thoroughly using chain-of-thought prompting.

Paper Structure

This paper contains 17 sections, 4 figures, 5 tables.

Figures (4)

  • Figure 1: Question Answering setups with documents containing relevant information (green) and distractor documents (gray) placed at different ordinal positions, with both Single-hop (\ref{['fig:front_page_a']}) and Multi-hop (\ref{['fig:front_page_b']}) questions.
  • Figure 2: The performance impacts of varying the positions of relevant documents within instruction-tuned models' inputs, with context reduction techniques and Chain-of-Thought prompting. All positions are out of 20 total documents. KG + CoT results for gpt-3.5-turbo are omitted to Appendix \ref{['sec:full_results']} to highlight other results.
  • Figure 3: Experimental results for Llama-2-7b-longlora-8k-ft. Results for MuSiQue 3- and 4-hop splits are relegated to Appendix \ref{['sec:full_results']} due to exceedingly poor performance.
  • Figure 4: Average question-answering accuracy for full document prompts by distance setting for GPT and MPT models. Performance with adjacent evidence documents is generally higher than when evidence documents are separated by distractor documents.