QED: A Framework and Dataset for Explanations in Question Answering
Matthew Lamm, Jennimaria Palomaki, Chris Alberti, Daniel Andor, Eunsol Choi, Livio Baldini Soares, Michael Collins
TL;DR
<3-5 sentence high-level summary>QED addresses the need for explanations in QA by proposing a linguistically grounded framework that bases explanations on referential equality and predicate entailment. It introduces a publicly released dataset annotated on Natural Questions and defines four modeling tasks, with baseline results showing that even small amounts of QED data can bolster QA performance. A rater study demonstrates that QED explanations help untrained evaluators spot errors in strong QA baselines, supporting the utility of faithful explanations. The work lays a path for extending QA with structured semantic explanations and invites future research on faithfulness and richer referential phenomena.
Abstract
A question answering system that in addition to providing an answer provides an explanation of the reasoning that leads to that answer has potential advantages in terms of debuggability, extensibility and trust. To this end, we propose QED, a linguistically informed, extensible framework for explanations in question answering. A QED explanation specifies the relationship between a question and answer according to formal semantic notions such as referential equality, sentencehood, and entailment. We describe and publicly release an expert-annotated dataset of QED explanations built upon a subset of the Google Natural Questions dataset, and report baseline models on two tasks -- post-hoc explanation generation given an answer, and joint question answering and explanation generation. In the joint setting, a promising result suggests that training on a relatively small amount of QED data can improve question answering. In addition to describing the formal, language-theoretic motivations for the QED approach, we describe a large user study showing that the presence of QED explanations significantly improves the ability of untrained raters to spot errors made by a strong neural QA baseline.
