Modeling Contextual Passage Utility for Multihop Question Answering
Akriti Jain, Aparna Garimella
TL;DR
This work tackles the challenge that passage utility in multihop QA is inherently context-dependent and influenced by the surrounding reasoning chain. It introduces a lightweight contextual utility predictor based on RoBERTa-large, trained via synthetic data generated from explicit reasoning traces and scored by GPT-4o to capture inter-passage dependencies. Empirical results across HotpotQA, MuSiQue, and 2WikiMultiHopQA show significant improvements in identifying useful passage sets, ranking them effectively, and boosting downstream QA performance compared to relevance-based baselines. The approach suggests that explicit modeling of context-aware utility can enhance retrieval-augmented QA and potentially generalize to other multi-step reasoning tasks.
Abstract
Multihop Question Answering (QA) requires systems to identify and synthesize information from multiple text passages. While most prior retrieval methods assist in identifying relevant passages for QA, further assessing the utility of the passages can help in removing redundant ones, which may otherwise add to noise and inaccuracies in the generated answers. Existing utility prediction approaches model passage utility independently, overlooking a critical aspect of multihop reasoning: the utility of a passage can be context-dependent, influenced by its relation to other passages - whether it provides complementary information or forms a crucial link in conjunction with others. In this paper, we propose a lightweight approach to model contextual passage utility, accounting for inter-passage dependencies. We fine-tune a small transformer-based model to predict passage utility scores for multihop QA. We leverage the reasoning traces from an advanced reasoning model to capture the order in which passages are used to answer a question and obtain synthetic training data. Through comprehensive experiments, we demonstrate that our utility-based scoring of retrieved passages leads to improved reranking and downstream QA performance compared to relevance-based reranking methods.
