e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations
Virginie Do, Oana-Maria Camburu, Zeynep Akata, Thomas Lukasiewicz
TL;DR
SNLI-VE suffers substantial neutral-label errors due to automatic dataset construction. The authors correct neutral labels to create SNLI-VE-2.0, re-evaluate baseline visual-textual entailment with BUTD, and implement e-SNLI-VE by appending human explanations; they further explore explanation-aware VTE models (PredictAndExplain and ExplainThenPredict) to learn from and generate explanations. The work improves data quality, demonstrates the feasibility of learning from explanations without sacrificing much accuracy, and provides a thorough qualitative and quantitative analysis of explanation relevance and limits. Overall, this study advances robust and explainable multimodal reasoning by integrating crowdsourced explanations into VTE data and models, while highlighting challenges in producing consistently relevant explanations.
Abstract
The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning. However, the automatic way in which SNLI-VE has been assembled (via combining parts of two related datasets) gives rise to a large number of errors in the labels of this corpus. In this paper, we first present a data collection effort to correct the class with the highest error rate in SNLI-VE. Secondly, we re-evaluate an existing model on the corrected corpus, which we call SNLI-VE-2.0, and provide a quantitative comparison with its performance on the non-corrected corpus. Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE-2.0. Finally, we train models that learn from these explanations at training time, and output such explanations at testing time.
