Table of Contents
Fetching ...

Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering

Rongzhi Zhu, Xiangyu Liu, Zequn Sun, Yiwei Wang, Wei Hu

TL;DR

This work addresses lost-in-retrieval in retrieval-augmented multi-hop QA, where sub-questions often omit key entities, hindering retrieval and reasoning. It introduces ChainRAG, a progressive framework that builds a sentence graph with entity indexing, performs seed retrieval and iterative expansion, rewrites sub-questions to fill missing entities, and integrates answers and contexts to produce a final response. Empirical results on MuSiQue, 2Wiki, and HotpotQA across GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus show substantial gains over baselines in both accuracy (F1/EM) and efficiency, with robust performance across models. The approach demonstrates practical potential for robust, scalable multi-hop QA by leveraging structured sentence-level retrieval and targeted entity completion without heavy knowledge-graph construction.

Abstract

In this paper, we identify a critical problem, "lost-in-retrieval", in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets - MuSiQue, 2Wiki, and HotpotQA - using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.

Mitigating Lost-in-Retrieval Problems in Retrieval Augmented Multi-Hop Question Answering

TL;DR

This work addresses lost-in-retrieval in retrieval-augmented multi-hop QA, where sub-questions often omit key entities, hindering retrieval and reasoning. It introduces ChainRAG, a progressive framework that builds a sentence graph with entity indexing, performs seed retrieval and iterative expansion, rewrites sub-questions to fill missing entities, and integrates answers and contexts to produce a final response. Empirical results on MuSiQue, 2Wiki, and HotpotQA across GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus show substantial gains over baselines in both accuracy (F1/EM) and efficiency, with robust performance across models. The approach demonstrates practical potential for robust, scalable multi-hop QA by leveraging structured sentence-level retrieval and targeted entity completion without heavy knowledge-graph construction.

Abstract

In this paper, we identify a critical problem, "lost-in-retrieval", in retrieval-augmented multi-hop question answering (QA): the key entities are missed in LLMs' sub-question decomposition. "Lost-in-retrieval" significantly degrades the retrieval performance, which disrupts the reasoning chain and leads to the incorrect answers. To resolve this problem, we propose a progressive retrieval and rewriting method, namely ChainRAG, which sequentially handles each sub-question by completing missing key entities and retrieving relevant sentences from a sentence graph for answer generation. Each step in our retrieval and rewriting process builds upon the previous one, creating a seamless chain that leads to accurate retrieval and answers. Finally, all retrieved sentences and sub-question answers are integrated to generate a comprehensive answer to the original question. We evaluate ChainRAG on three multi-hop QA datasets - MuSiQue, 2Wiki, and HotpotQA - using three large language models: GPT4o-mini, Qwen2.5-72B, and GLM-4-Plus. Empirical results demonstrate that ChainRAG consistently outperforms baselines in both effectiveness and efficiency.

Paper Structure

This paper contains 35 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: Example of the "lost in retrieval" issue where the second sub-question retrieves irrelevant text due to the unclear key entity, leading to an incorrect answer.
  • Figure 2: Analysis of "lost in retrieval". We evaluate the Recall@2 (%) of different sub-questions.
  • Figure 3: Framework overview of ChainRAG. It first constructs a sentence graph, where the edges between sentence nodes are labeled by their common named entities. Given a question, it is decomposed into sub-questions. Then, our iterative process involves retrieval, answering, and rewriting the unclear sub-question by filling in missing entities. Finally, it integrates all retrieved sentences and answers to produce a comprehensive answer.
  • Figure 4: F1 (%) comparison of ablation study.
  • Figure 5: EM (%) comparison of ablation study.
  • ...and 3 more figures