Table of Contents
Fetching ...

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

Tuo Liang, Zhe Hu, Jing Li, Hao Zhang, Yiren Lu, Yunlai Zhou, Yiran Qiao, Disheng Liu, Jeirui Peng, Jing Ma, Yu Yin

TL;DR

The paper addresses the challenge of humor understanding in juxtaposed narratives by introducing YesBut (V2), a large, multilingual comic benchmark with rich narrative annotations across four progressive tasks. It systematically evaluates a broad spectrum of vision-language models and large language models, revealing that even state-of-the-art systems underperform humans on surface and deep reasoning tasks, with error patterns centered on perception, element identification, and hallucination. Through extensive analyses, the authors demonstrate that improvements can be achieved via text-only data distillation and explicit social knowledge augmentation, providing practical directions for developing context-aware, deeper-narrative understanding in multimodal AI. These findings highlight critical gaps in current VLMs’ ability to interpret cultural and creative expressions and offer concrete strategies to advance AI toward more robust, socially aware multimodal reasoning in real-world applications.

Abstract

Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.

When 'YES' Meets 'BUT': Can Large Models Comprehend Contradictory Humor Through Comparative Reasoning?

TL;DR

The paper addresses the challenge of humor understanding in juxtaposed narratives by introducing YesBut (V2), a large, multilingual comic benchmark with rich narrative annotations across four progressive tasks. It systematically evaluates a broad spectrum of vision-language models and large language models, revealing that even state-of-the-art systems underperform humans on surface and deep reasoning tasks, with error patterns centered on perception, element identification, and hallucination. Through extensive analyses, the authors demonstrate that improvements can be achieved via text-only data distillation and explicit social knowledge augmentation, providing practical directions for developing context-aware, deeper-narrative understanding in multimodal AI. These findings highlight critical gaps in current VLMs’ ability to interpret cultural and creative expressions and offer concrete strategies to advance AI toward more robust, socially aware multimodal reasoning in real-world applications.

Abstract

Understanding humor-particularly when it involves complex, contradictory narratives that require comparative reasoning-remains a significant challenge for large vision-language models (VLMs). This limitation hinders AI's ability to engage in human-like reasoning and cultural expression. In this paper, we investigate this challenge through an in-depth analysis of comics that juxtapose panels to create humor through contradictions. We introduce the YesBut (V2), a novel benchmark with 1,262 comic images from diverse multilingual and multicultural contexts, featuring comprehensive annotations that capture various aspects of narrative understanding. Using this benchmark, we systematically evaluate a wide range of VLMs through four complementary tasks spanning from surface content comprehension to deep narrative reasoning, with particular emphasis on comparative reasoning between contradictory elements. Our extensive experiments reveal that even the most advanced models significantly underperform compared to humans, with common failures in visual perception, key element identification, comparative analysis and hallucinations. We further investigate text-based training strategies and social knowledge augmentation methods to enhance model performance. Our findings not only highlight critical weaknesses in VLMs' understanding of cultural and creative expressions but also provide pathways toward developing context-aware models capable of deeper narrative understanding though comparative reasoning.

Paper Structure

This paper contains 54 sections, 17 figures, 8 tables.

Figures (17)

  • Figure 1: We introduce the YesBut (V2), a benchmark for assessing AI's ability to interpret juxtaposed comic panels with contradictory narratives. Unlike existing benchmarks, it emphasizes visual understanding, comparative reasoning, and social knowledge. To capture the layered reasoning required for interpreting these contradictions, we design multi-tiered tasks—ranging from basic content recognition to deep narrative comprehension—ensuring a comprehensive assessment of AI’s interpretative abilities.
  • Figure 2: Overview of the Data Construction Pipeline. The dataset construction begins with manually collecting images from social media platforms, verified by human reviewers to ensure authenticity and relevance. Next, a progressive human-AI collaborative annotation stage is employed to enhance labeling accuracy and efficiency. Finally, a rigorous quality control and cross-verification stage is conducted with multiple annotators to refine and validate the dataset.
  • Figure 3: Distribution of the original 1,264 comics downloaded from social media based on different aspects, including embedded text presence, reliance on social knowledge, and distinct humor categories. Overall, we show that our YesBut exhibits balanced text presence, provides insights into social norms and cultural expectations, and captures a diverse thematic range of humor.
  • Figure 4: Human performance on deep reasoning tasks.
  • Figure 5: Human Evaluation of Literal Description and Contradiction Generation Tasks.
  • ...and 12 more figures