Table of Contents
Fetching ...

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

Jiwan Chung, Sungjae Lee, Minseo Kim, Seungju Han, Ashkan Yousefpour, Jack Hessel, Youngjae Yu

TL;DR

This work presents VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises, 5,574 commonsense premises, and reasoning trees connecting them into structured arguments, and proposes three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction.

Abstract

Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction. Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5% worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises improved model performance significantly.

Selective Vision is the Challenge for Visual Reasoning: A Benchmark for Visual Argument Understanding

TL;DR

This work presents VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises, 5,574 commonsense premises, and reasoning trees connecting them into structured arguments, and proposes three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction.

Abstract

Visual arguments, often used in advertising or social causes, rely on images to persuade viewers to do or believe something. Understanding these arguments requires selective vision: only specific visual stimuli within an image are relevant to the argument, and relevance can only be understood within the context of a broader argumentative structure. While visual arguments are readily appreciated by human audiences, we ask: are today's AI capable of similar understanding? We present VisArgs, a dataset of 1,611 images annotated with 5,112 visual premises (with regions), 5,574 commonsense premises, and reasoning trees connecting them into structured arguments. We propose three tasks for evaluating visual argument understanding: premise localization, premise identification, and conclusion deduction. Experiments show that 1) machines struggle to capture visual cues: GPT-4-O achieved 78.5% accuracy, while humans reached 98.0%. Models also performed 19.5% worse when distinguishing between irrelevant objects within the image compared to external objects. 2) Providing relevant visual premises improved model performance significantly.
Paper Structure (33 sections, 12 figures, 16 tables)

This paper contains 33 sections, 12 figures, 16 tables.

Figures (12)

  • Figure 1: An example from our VisArgs corpus. VisArgs makes the persuasion process in a visual argument explicit by representing it as a reasoning tree. Image credit: Eglė Plytnikaitė
  • Figure 2: To identify the bottleneck in visual argument understanding, we define three tasks over VisArgs: Localization of Premises requires models to ground the visual premises. Identification of Premises necessitates models to infer the visual premise relevant to the given intermediate conclusion. Deduction of Conclusion studies the ability of models to deduce the argument's conclusion based on different levels of inputs.
  • Figure 3: Human workers iteratively refine initial data produced by machines in VisArgs annotation process.
  • Figure 4: Variety of the topics represented in the visual premises and conclusions in VisArgs.
  • Figure 5: Failure cases of LLaVA-1.5 in Identification of Premises. The model incorrectly reasons about relevant objects, relying instead on common words.
  • ...and 7 more figures