Table of Contents
Fetching ...

GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?

Barah Fazili, Koustava Goswami, Natwar Modani, Inderjeet Nair

TL;DR

GenSco introduces a two-LLM framework for multi-hop QA that uses question decomposition to guide passage sequence selection and final answer generation. A Generator LLM creates subquestions while a Scorer LLM evaluates candidate passages to build an aligned, ordered context, enabling more accurate and faithful answers with fewer distracting passages. Evaluations on 2WikiMultiHop, Adversarial HotPotQA, and MuSiQue show substantial exact-match gains, particularly a +15.1 EM improvement on MuSiQue and +5.9 EM on 2WikiMultiHop, along with improved retrieval precision and reduced hallucinations. The approach is inference-only and can augment existing retrieval systems, though its benefits hinge on the size of the candidate passage set and the availability of capable LLMs.

Abstract

Retrieval augmented generation (RAG) with large language models (LLMs) for Question Answering (QA) entails furnishing relevant context within the prompt to facilitate the LLM in answer generation. During the generation, inaccuracies or hallucinations frequently occur due to two primary factors: inadequate or distracting context in the prompts, and the inability of LLMs to effectively reason through the facts. In this paper, we investigate whether providing aligned context via a carefully selected passage sequence leads to better answer generation by the LLM for multi-hop QA. We introduce, "GenSco", a novel approach of selecting passages based on the predicted decomposition of the multi-hop questions}. The framework consists of two distinct LLMs: (i) Generator LLM, which is used for question decomposition and final answer generation; (ii) an auxiliary open-sourced LLM, used as the scorer, to semantically guide the Generator for passage selection. The generator is invoked only once for the answer generation, resulting in a cost-effective and efficient approach. We evaluate on three broadly established multi-hop question answering datasets: 2WikiMultiHop, Adversarial HotPotQA and MuSiQue and achieve an absolute gain of $15.1$ and $5.9$ points in Exact Match score with respect to the best performing baselines over MuSiQue and 2WikiMultiHop respectively.

GenSco: Can Question Decomposition based Passage Alignment improve Question Answering?

TL;DR

GenSco introduces a two-LLM framework for multi-hop QA that uses question decomposition to guide passage sequence selection and final answer generation. A Generator LLM creates subquestions while a Scorer LLM evaluates candidate passages to build an aligned, ordered context, enabling more accurate and faithful answers with fewer distracting passages. Evaluations on 2WikiMultiHop, Adversarial HotPotQA, and MuSiQue show substantial exact-match gains, particularly a +15.1 EM improvement on MuSiQue and +5.9 EM on 2WikiMultiHop, along with improved retrieval precision and reduced hallucinations. The approach is inference-only and can augment existing retrieval systems, though its benefits hinge on the size of the candidate passage set and the availability of capable LLMs.

Abstract

Retrieval augmented generation (RAG) with large language models (LLMs) for Question Answering (QA) entails furnishing relevant context within the prompt to facilitate the LLM in answer generation. During the generation, inaccuracies or hallucinations frequently occur due to two primary factors: inadequate or distracting context in the prompts, and the inability of LLMs to effectively reason through the facts. In this paper, we investigate whether providing aligned context via a carefully selected passage sequence leads to better answer generation by the LLM for multi-hop QA. We introduce, "GenSco", a novel approach of selecting passages based on the predicted decomposition of the multi-hop questions}. The framework consists of two distinct LLMs: (i) Generator LLM, which is used for question decomposition and final answer generation; (ii) an auxiliary open-sourced LLM, used as the scorer, to semantically guide the Generator for passage selection. The generator is invoked only once for the answer generation, resulting in a cost-effective and efficient approach. We evaluate on three broadly established multi-hop question answering datasets: 2WikiMultiHop, Adversarial HotPotQA and MuSiQue and achieve an absolute gain of and points in Exact Match score with respect to the best performing baselines over MuSiQue and 2WikiMultiHop respectively.
Paper Structure (35 sections, 3 equations, 4 figures, 11 tables, 1 algorithm)

This paper contains 35 sections, 3 equations, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: GenSco : subquestion at each level is generated using subQ-Gen module , the Scorer module is invoked for selecting the passage (greedy algorithm). The sequence of passages are then passed as context to G to generate the final answer (bottom)
  • Figure 2: Histogram of delta (number of supporting passages - number of passages retrieved by GenSco-stop) for subsets of data with 1,2 and 4 supporting passages for 2WikiMultiHop dataset (left to right, top to bottom)
  • Figure 3: Performance across different sized subsets of the 2WikiMultiHop dataset.
  • Figure 4: Scatter plot of answers for 2WikiMultiHop