Table of Contents
Fetching ...

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

Yue Zhang, Liqiang Jing, Vibhav Gogate

TL;DR

This work defines Defeasible Visual Entailment (DVE), a task that allows textual updates to modify the entailment relationship between an image $I$ and a hypothesis $H$. It builds a first DVE benchmark by pairing Flickr30k premises with SNLI hypotheses and $ extit{delta-NLI}$ updates, introducing classification and generation sub-tasks. A novel inference-aware evaluator, trained with pairwise contrastive learning and categorical information loss, outputs an entailment-strength score $s$ reflecting how updates shift $H$'s truth likelihood given $I$, and is validated against human judgments. To further improve update quality, a reward-driven update optimization loop uses evaluator feedback to refine generated updates beyond baseline multimodal models. Experimental results on classification and generation demonstrate strong GPT-4o performance and show that the evaluator correlates better with human judgments than traditional metrics, with the optimization framework yielding higher-quality, more effective updates for defeasible multimodal reasoning.

Abstract

We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.

Defeasible Visual Entailment: Benchmark, Evaluator, and Reward-Driven Optimization

TL;DR

This work defines Defeasible Visual Entailment (DVE), a task that allows textual updates to modify the entailment relationship between an image and a hypothesis . It builds a first DVE benchmark by pairing Flickr30k premises with SNLI hypotheses and updates, introducing classification and generation sub-tasks. A novel inference-aware evaluator, trained with pairwise contrastive learning and categorical information loss, outputs an entailment-strength score reflecting how updates shift 's truth likelihood given , and is validated against human judgments. To further improve update quality, a reward-driven update optimization loop uses evaluator feedback to refine generated updates beyond baseline multimodal models. Experimental results on classification and generation demonstrate strong GPT-4o performance and show that the evaluator correlates better with human judgments than traditional metrics, with the optimization framework yielding higher-quality, more effective updates for defeasible multimodal reasoning.

Abstract

We introduce a new task called Defeasible Visual Entailment (DVE), where the goal is to allow the modification of the entailment relationship between an image premise and a text hypothesis based on an additional update. While this concept is well-established in Natural Language Inference, it remains unexplored in visual entailment. At a high level, DVE enables models to refine their initial interpretations, leading to improved accuracy and reliability in various applications such as detecting misleading information in images, enhancing visual question answering, and refining decision-making processes in autonomous systems. Existing metrics do not adequately capture the change in the entailment relationship brought by updates. To address this, we propose a novel inference-aware evaluator designed to capture changes in entailment strength induced by updates, using pairwise contrastive learning and categorical information learning. Additionally, we introduce a reward-driven update optimization method to further enhance the quality of updates generated by multimodal models. Experimental results demonstrate the effectiveness of our proposed evaluator and optimization method.

Paper Structure

This paper contains 43 sections, 10 equations, 9 figures, 7 tables.

Figures (9)

  • Figure 1: An example of defeasibility in visual entailment.
  • Figure 2: The workflow of generating the DVE dataset by integrating premises and hypotheses from SNLI with images from Flickr30k and updates from $\delta$-NLI.
  • Figure 3: The architecture of our Inference-aware Evaluator, including three modules: Multimodal Embedding, Feature Fusion, and Multitask Learning. HC/HU Embedding means the embedding of the hypothesis-caption/hypothesis-update pair. Similarly, HC/HU Multimodal Representation stands for the multimodel representation of the hypothesis-caption/hypothesis-update pair.
  • Figure 4: An overview of Reward-driven Update Optimization, which includes three steps: Initial Response Generation, Critique, and Refinement.
  • Figure 5: Prompt used for the Classification Task across LVLMs.
  • ...and 4 more figures