Visual Entailment Task for Visually-Grounded Language Learning

Ning Xie; Farley Lai; Derek Doran; Asim Kadav

Visual Entailment Task for Visually-Grounded Language Learning

Ning Xie, Farley Lai, Derek Doran, Asim Kadav

TL;DR

The paper introduces Visual Entailment (VE), a cross-modal inference task where an image serves as the premise for textual hypotheses, and provides the SNLI-VE dataset by aligning SNLI premises with Flickr30k images. It proposes EVE, a dual-branch architecture that applies self-attention and text-image grounding to determine entailment, neutral, or contradiction. Compared with VQA-based baselines, EVE-Image achieves the strongest performance (~71% accuracy) on SNLI-VE, demonstrating the value of grounded, explainable multimodal reasoning. The work releases SNLI-VE publicly and discusses dataset design and biases relevant to cross-modal inference.

Abstract

We introduce a new inference task - Visual Entailment (VE) - which differs from traditional Textual Entailment (TE) tasks whereby a premise is defined by an image, rather than a natural language sentence as in TE tasks. A novel dataset SNLI-VE (publicly available at https://github.com/necla-ml/SNLI-VE) is proposed for VE tasks based on the Stanford Natural Language Inference corpus and Flickr30k. We introduce a differentiable architecture called the Explainable Visual Entailment model (EVE) to tackle the VE problem. EVE and several other state-of-the-art visual question answering (VQA) based models are evaluated on the SNLI-VE dataset, facilitating grounded language understanding and providing insights on how modern VQA based models perform.

Visual Entailment Task for Visually-Grounded Language Learning

TL;DR

Abstract

Visual Entailment Task for Visually-Grounded Language Learning

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)