Table of Contents
Fetching ...

e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations

Virginie Do, Oana-Maria Camburu, Zeynep Akata, Thomas Lukasiewicz

TL;DR

SNLI-VE suffers substantial neutral-label errors due to automatic dataset construction. The authors correct neutral labels to create SNLI-VE-2.0, re-evaluate baseline visual-textual entailment with BUTD, and implement e-SNLI-VE by appending human explanations; they further explore explanation-aware VTE models (PredictAndExplain and ExplainThenPredict) to learn from and generate explanations. The work improves data quality, demonstrates the feasibility of learning from explanations without sacrificing much accuracy, and provides a thorough qualitative and quantitative analysis of explanation relevance and limits. Overall, this study advances robust and explainable multimodal reasoning by integrating crowdsourced explanations into VTE data and models, while highlighting challenges in producing consistently relevant explanations.

Abstract

The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning. However, the automatic way in which SNLI-VE has been assembled (via combining parts of two related datasets) gives rise to a large number of errors in the labels of this corpus. In this paper, we first present a data collection effort to correct the class with the highest error rate in SNLI-VE. Secondly, we re-evaluate an existing model on the corrected corpus, which we call SNLI-VE-2.0, and provide a quantitative comparison with its performance on the non-corrected corpus. Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE-2.0. Finally, we train models that learn from these explanations at training time, and output such explanations at testing time.

e-SNLI-VE: Corrected Visual-Textual Entailment with Natural Language Explanations

TL;DR

SNLI-VE suffers substantial neutral-label errors due to automatic dataset construction. The authors correct neutral labels to create SNLI-VE-2.0, re-evaluate baseline visual-textual entailment with BUTD, and implement e-SNLI-VE by appending human explanations; they further explore explanation-aware VTE models (PredictAndExplain and ExplainThenPredict) to learn from and generate explanations. The work improves data quality, demonstrates the feasibility of learning from explanations without sacrificing much accuracy, and provides a thorough qualitative and quantitative analysis of explanation relevance and limits. Overall, this study advances robust and explainable multimodal reasoning by integrating crowdsourced explanations into VTE data and models, while highlighting challenges in producing consistently relevant explanations.

Abstract

The recently proposed SNLI-VE corpus for recognising visual-textual entailment is a large, real-world dataset for fine-grained multimodal reasoning. However, the automatic way in which SNLI-VE has been assembled (via combining parts of two related datasets) gives rise to a large number of errors in the labels of this corpus. In this paper, we first present a data collection effort to correct the class with the highest error rate in SNLI-VE. Secondly, we re-evaluate an existing model on the corrected corpus, which we call SNLI-VE-2.0, and provide a quantitative comparison with its performance on the non-corrected corpus. Thirdly, we introduce e-SNLI-VE, which appends human-written natural language explanations to SNLI-VE-2.0. Finally, we train models that learn from these explanations at training time, and output such explanations at testing time.

Paper Structure

This paper contains 26 sections, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Examples from SNLI-VE-2.0. (a) In red, the neutral label from SNLI-VE is wrong, since the picture clearly shows that the crowd is outdoors. We corrected it to entailment in SNLI-VE-2.0. (b) In green, an ambiguous instance. There is indeed an American flag in the background but it is very hard to see, hence the ambiguity between neutral and entailment, and even contradiction if one cannot spot it. Further, it is not clear whether "they" implies the whole group or the people visible in the image.
  • Figure 2: MTurk annotation screen. (a) The label contradiction is chosen, (b) the evidence words "man", "violin", and "crowd" are highlighted, and (c) an explanation is written with these words.
  • Figure 3: Two image-sentence pairs from e-SNLI-VE with (a) at the top, an uninformative explanation from e-SNLI, (b) at the bottom, an explanation collected from our crowdsourcing. We only collected new explanations for the neutral class (along with new labels). The SNLI premise is not included in e-SNLI-VE.
  • Figure 4: PaE-BUTD-VE. The generation of explanation is conditioned on the image premise, textual hypothesis, and predicted label.
  • Figure 5: Architecture of EtP-BUTD-VE. Firstly, an explanation is generated, secondly the label is predicted from the explanation. The two models (in separate dashed rectangles) are not trained jointly.
  • ...and 6 more figures