YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Abhilash Nandy; Yash Agarwal; Ashish Patwa; Millon Madhur Das; Aman Bansal; Ankit Raj; Pawan Goyal; Niloy Ganguly

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Abhilash Nandy, Yash Agarwal, Ashish Patwa, Millon Madhur Das, Aman Bansal, Ankit Raj, Pawan Goyal, Niloy Ganguly

TL;DR

This paper proposes the challenging tasks of Satirical Image Detection, Understanding, Understanding, and Completion, and releases a high-quality dataset ***YesBut***, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks.

Abstract

Understanding satire and humor is a challenging task for even current Vision-Language models. In this paper, we propose the challenging tasks of Satirical Image Detection (detecting whether an image is satirical), Understanding (generating the reason behind the image being satirical), and Completion (given one half of the image, selecting the other half from 2 given options, such that the complete image is satirical) and release a high-quality dataset YesBut, consisting of 2547 images, 1084 satirical and 1463 non-satirical, containing different artistic styles, to evaluate those tasks. Each satirical image in the dataset depicts a normal scenario, along with a conflicting scenario which is funny or ironic. Despite the success of current Vision-Language Models on multimodal tasks such as Visual QA and Image Captioning, our benchmarking experiments show that such models perform poorly on the proposed tasks on the YesBut Dataset in Zero-Shot Settings w.r.t both automated as well as human evaluation. Additionally, we release a dataset of 119 real, satirical photographs for further research. The dataset and code are available at https://github.com/abhi1nandy2/yesbut_dataset.

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

TL;DR

Abstract

Paper Structure (33 sections, 14 figures, 12 tables)

This paper contains 33 sections, 14 figures, 12 tables.

Introduction
Background
Satirical and Humor Datasets
Other Image Datasets
Our Annotation Pipeline
Stage 1: Collecting Satirical Images from Social Media
Stage 2: Annotation of satirical images
Stage 3: Generating 2D stick images using DALL-E 3 on the annotated descriptions
Stage 4: Generating 3D stick images using DALL-E 3 on the annotated descriptions
The YesBut Dataset
Experimental Setup
Models
Tasks
Evaluation Setup
Results
...and 18 more sections

Figures (14)

Figure 1: Satire conveyed through a social media image
Figure 2: Our annotation Pipeline for YesBut in 4 Stages - (1) Collecting Satirical Images from Social Media (2) Human Annotation of satirical images (3) Generating 2D stick images using DALL-E 3 and annotated descriptions (4) Generating 3D stick images using DALL-E 3 and annotated descriptions
Figure 3: Distribution of the original 283 satirical images downloaded from Social Media based on different aspects of image content and annotated descriptions
Figure 4: 2D UMAP Representations of CLIP Image representations of YesBut sub-images
Figure 5: Evaluation of Satirical Image Understanding Capability using multiple VL models at different stages (Stages 2, 3, 4) of annotation of YesBut, as well as, for all YesBut images
...and 9 more figures

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

TL;DR

Abstract

YesBut: A High-Quality Annotated Multimodal Dataset for evaluating Satire Comprehension capability of Vision-Language Models

Authors

TL;DR

Abstract

Table of Contents

Figures (14)