HumorDB: Can AI understand graphical humor?

Vedaant Jain; Felipe dos Santos Alves Feitosa; Gabriel Kreiman

HumorDB: Can AI understand graphical humor?

Vedaant Jain, Felipe dos Santos Alves Feitosa, Gabriel Kreiman

TL;DR

HumorDB tackles graphical humor understanding, a challenging visual reasoning task requiring contextual knowledge and incongruity detection. It introduces a large, diverse, controlled image dataset with minimally contrastive funny/not-funny pairs and a three-task human evaluation to benchmark AI performance. The study benchmarks a wide range of vision-only and vision-language models, analyzes model explanations and attention patterns via mechanistic interpretability methods, and reveals a persistent gap between AI and human humor understanding, especially on abstract sketches and subtle cues. The dataset and accompanying analyses highlight the need for architectures capable of bridging visual perception and abstract reasoning, and provide a resource and methodology for advancing visual humor understanding.

Abstract

Despite significant advancements in image segmentation and object detection, understanding complex scenes remains a significant challenge. Here, we focus on graphical humor as a paradigmatic example of image interpretation that requires elucidating the interaction of different scene elements in the context of prior cognitive knowledge. This paper introduces \textbf{HumorDB}, a novel, controlled, and carefully curated dataset designed to evaluate and advance visual humor understanding by AI systems. The dataset comprises diverse images spanning photos, cartoons, sketches, and AI-generated content, including minimally contrastive pairs where subtle edits differentiate between humorous and non-humorous versions. We evaluate humans, state-of-the-art vision models, and large vision-language models on three tasks: binary humor classification, funniness rating prediction, and pairwise humor comparison. The results reveal a gap between current AI systems and human-level humor understanding. While pretrained vision-language models perform better than vision-only models, they still struggle with abstract sketches and subtle humor cues. Analysis of attention maps shows that even when models correctly classify humorous images, they often fail to focus on the precise regions that make the image funny. Preliminary mechanistic interpretability studies and evaluation of model explanations provide initial insights into how different architectures process humor. Our results identify promising trends and current limitations, suggesting that an effective understanding of visual humor requires sophisticated architectures capable of detecting subtle contextual features and bridging the gap between visual perception and abstract reasoning. All the code and data are available here: \href{https://github.com/kreimanlab/HumorDB}{https://github.com/kreimanlab/HumorDB}

HumorDB: Can AI understand graphical humor?

TL;DR

Abstract

Paper Structure (26 sections, 14 figures, 6 tables)

This paper contains 26 sections, 14 figures, 6 tables.

Introduction
Related Work
Methods
Building HumorDB
Assessing human performance
Experiments
Results
Acknowledgements
Appendix
Training details
External assets used
Attention maps
Crowdsourcing details
Multimodal models' answer explanations
Participants' Demographics
...and 11 more sections

Figures (14)

Figure 1: Example image pair. Left: image rated as funny (83.3% of participants). Right: modified image rated as not funny (85.7% of participants). Focus on the phone in the surgeon's hand in the left image.
Figure 2: The data showed between-subject consistency. Each cell (i, j) represents the percentage of times when image i was rated funnier than image j. Participants tended to agree on which image was funnier, showing images 6 and 3 being rated funnier than others most times.
Figure 3: Modifications rendered images less humorous. Each point compares the rating of image pairs (y-axis: original, x-axis: modified pair; total 1,271 pairs; line = identity). For the majority of images ($86.4 \%$), the ratings for the original images were higher.
Figure 4: Participants showed self-consistency in the Comparison task. The x-axis shows the representative images, while the y-axis shows the percentage of instances where a user's second rating matched their first for comparison containing the particular comparison image. A 100% match is perfect self-consistency.
Figure 5: Participants showed high self-reliability (Range Ratings). Higher ratings denote more humorous images. For each participant, 10 images were presented twice at random time points to assess reliability. There was a strong correlation ($\rho$ = 0.89) between the first and second ratings (1,800 pairs; circle sizes indicate number of ratings/pair). The dashed line shows the diagonal.
...and 9 more figures

HumorDB: Can AI understand graphical humor?

TL;DR

Abstract

HumorDB: Can AI understand graphical humor?

Authors

TL;DR

Abstract

Table of Contents

Figures (14)