Table of Contents
Fetching ...

Modeling Contextual Passage Utility for Multihop Question Answering

Akriti Jain, Aparna Garimella

TL;DR

This work tackles the challenge that passage utility in multihop QA is inherently context-dependent and influenced by the surrounding reasoning chain. It introduces a lightweight contextual utility predictor based on RoBERTa-large, trained via synthetic data generated from explicit reasoning traces and scored by GPT-4o to capture inter-passage dependencies. Empirical results across HotpotQA, MuSiQue, and 2WikiMultiHopQA show significant improvements in identifying useful passage sets, ranking them effectively, and boosting downstream QA performance compared to relevance-based baselines. The approach suggests that explicit modeling of context-aware utility can enhance retrieval-augmented QA and potentially generalize to other multi-step reasoning tasks.

Abstract

Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multihop reasoning: the utility of a passage can be context-dependent, influenced by its relation to other passages - whether it provides complementary information or forms a crucial link in conjunction with others. In this paper, we propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question and obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to improved reranking and downstream QA performance compared to relevance-based reranking methods.

Modeling Contextual Passage Utility for Multihop Question Answering

TL;DR

This work tackles the challenge that passage utility in multihop QA is inherently context-dependent and influenced by the surrounding reasoning chain. It introduces a lightweight contextual utility predictor based on RoBERTa-large, trained via synthetic data generated from explicit reasoning traces and scored by GPT-4o to capture inter-passage dependencies. Empirical results across HotpotQA, MuSiQue, and 2WikiMultiHopQA show significant improvements in identifying useful passage sets, ranking them effectively, and boosting downstream QA performance compared to relevance-based baselines. The approach suggests that explicit modeling of context-aware utility can enhance retrieval-augmented QA and potentially generalize to other multi-step reasoning tasks.

Abstract

Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multihop reasoning: the utility of a passage can be context-dependent, influenced by its relation to other passages - whether it provides complementary information or forms a crucial link in conjunction with others. In this paper, we propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question and obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to improved reranking and downstream QA performance compared to relevance-based reranking methods.

Paper Structure

This paper contains 12 sections, 1 equation, 4 figures, 2 tables.

Figures (4)

  • Figure 1: A multihop question from HotpotQA dataset yang2018hotpotqadatasetdiverseexplainable: Passage 2 if considered independently does not seem useful to answer the question. However, conditioned on Passage 1, it becomes useful.
  • Figure 2: Performance comparison of decoder-only models (LLaMA 3.2 1B and LLaMA 3.1 8B) on the HotpotQA dataset, fine-tuned using two different methods: Pointwise and Listwise scoring
  • Figure 3: Performance comparison of decoder-only models (LLaMA 3.2 1B and LLaMA 3.1 8B) on the MuSiQue dataset, fine-tuned using two different methods: Pointwise and Listwise scoring
  • Figure 4: Performance comparison of decoder-only models (LLaMA 3.2 1B and LLaMA 3.1 8B) on the 2WikiMultiHopQA dataset, fine-tuned using two different methods: Pointwise and Listwise scoring