Table of Contents
Fetching ...

Visual Entailment Task for Visually-Grounded Language Learning

Ning Xie, Farley Lai, Derek Doran, Asim Kadav

TL;DR

The paper introduces Visual Entailment (VE), a cross-modal inference task where an image serves as the premise for textual hypotheses, and provides the SNLI-VE dataset by aligning SNLI premises with Flickr30k images. It proposes EVE, a dual-branch architecture that applies self-attention and text-image grounding to determine entailment, neutral, or contradiction. Compared with VQA-based baselines, EVE-Image achieves the strongest performance (~71% accuracy) on SNLI-VE, demonstrating the value of grounded, explainable multimodal reasoning. The work releases SNLI-VE publicly and discusses dataset design and biases relevant to cross-modal inference.

Abstract

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Visual Entailment Task for Visually-Grounded Language Learning

TL;DR

The paper introduces Visual Entailment (VE), a cross-modal inference task where an image serves as the premise for textual hypotheses, and provides the SNLI-VE dataset by aligning SNLI premises with Flickr30k images. It proposes EVE, a dual-branch architecture that applies self-attention and text-image grounding to determine entailment, neutral, or contradiction. Compared with VQA-based baselines, EVE-Image achieves the strongest performance (~71% accuracy) on SNLI-VE, demonstrating the value of grounded, explainable multimodal reasoning. The work releases SNLI-VE publicly and discusses dataset design and biases relevant to cross-modal inference.

Abstract

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Paper Structure

This paper contains 7 sections, 2 figures, 2 tables.

Figures (2)

  • Figure 1: A VE example showing an image pairing with different hypotheses leads to different labels.
  • Figure 2: EVE architecture. EVE determines if a hypothesis (text input) is entailed by an image premise (image input). The bottom half shows two methods on image feature extraction, either from the CNN feature maps or object detection ROIs.