Table of Contents
Fetching ...

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

Vikas Yadav, Steven Bethard, Mihai Surdeanu

TL;DR

<3-5 sentence high-level summary> The paper tackles the challenge of selecting justification sentences for multi-hop QA without relying on labeled justification data. It introduces ROCC, an unsupervised method that builds candidate justification sets from BM25-retrieved sentences and ranks them using a joint objective that favors relevance, minimizes overlap, and maximizes coverage of the question and answer; the best set is then used to condition a BERT-based QA classifier. Empirically, ROCC achieves state-of-the-art results on ARC and MultiRC among approaches that do not use external training data, with AutoROCC offering strong domain robustness and higher justification quality. The work demonstrates both improved interpretability and practical viability, and suggests ROCC as a potential distant supervision signal for supervised justification strategies.

Abstract

We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

Quick and (not so) Dirty: Unsupervised Selection of Justification Sentences for Multi-hop Question Answering

TL;DR

<3-5 sentence high-level summary> The paper tackles the challenge of selecting justification sentences for multi-hop QA without relying on labeled justification data. It introduces ROCC, an unsupervised method that builds candidate justification sets from BM25-retrieved sentences and ranks them using a joint objective that favors relevance, minimizes overlap, and maximizes coverage of the question and answer; the best set is then used to condition a BERT-based QA classifier. Empirically, ROCC achieves state-of-the-art results on ARC and MultiRC among approaches that do not use external training data, with AutoROCC offering strong domain robustness and higher justification quality. The work demonstrates both improved interpretability and practical viability, and suggests ROCC as a potential distant supervision signal for supervised justification strategies.

Abstract

We propose an unsupervised strategy for the selection of justification sentences for multi-hop question answering (QA) that (a) maximizes the relevance of the selected sentences, (b) minimizes the overlap between the selected facts, and (c) maximizes the coverage of both question and answer. This unsupervised sentence selection method can be coupled with any supervised QA approach. We show that the sentences selected by our method improve the performance of a state-of-the-art supervised QA model on two multi-hop QA datasets: AI2's Reasoning Challenge (ARC) and Multi-Sentence Reading Comprehension (MultiRC). We obtain new state-of-the-art performance on both datasets among approaches that do not use external resources for training the QA system: 56.82% F1 on ARC (41.24% on Challenge and 64.49% on Easy) and 26.1% EM0 on MultiRC. Our justification sentences have higher quality than the justifications selected by a strong information retrieval baseline, e.g., by 5.4% F1 in MultiRC. We also show that our unsupervised selection of justification sentences is more stable across domains than a state-of-the-art supervised sentence selection method.

Paper Structure

This paper contains 15 sections, 1 equation, 2 figures, 7 tables.

Figures (2)

  • Figure 1: A multiple-choice question from the ARC dataset with the correct answer in bold, followed by justification sentences selected by our approach (ROCC) vs. sentences selected by a strong IR baseline (BM25). ROCC justification sentences fully cover the five key terms in the question (shown in italic), whereas BM25 misses two: esophagus and colon. Further, the second BM25 sentence is largely redundant with the first, not covering other query terms.
  • Figure 2: An example of the ROCC process for a question from the MultiRC dataset. Here, ROCC correctly extracts the two justification sentences necessary to explain the correct answer.