Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering
Vikas Yadav, Steven Bethard, Mihai Surdeanu
TL;DR
<3-5 sentence high-level summary> The paper tackles the challenge of selecting justification sentences for multi-hop QA without relying on labeled justification data. It introduces ROCC, an unsupervised method that builds candidate justification sets from BM25-retrieved sentences and ranks them using a joint objective that favors relevance, minimizes overlap, and maximizes coverage of the question and answer; the best set is then used to condition a BERT-based QA classifier. Empirically, ROCC achieves state-of-the-art results on ARC and MultiRC among approaches that do not use external training data, with AutoROCC offering strong domain robustness and higher justification quality. The work demonstrates both improved interpretability and practical viability, and suggests ROCC as a potential distant supervision signal for supervised justification strategies.
Abstract
We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.
