Table of Contents
Fetching ...

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

Zhe Hu, Tuo Liang, Jing Li, Yiren Lu, Yunlai Zhou, Yiran Qiao, Jing Ma, Yu Yin

TL;DR

This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction, and introduces the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics.

Abstract

Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.

Cracking the Code of Juxtaposition: Can AI Models Understand the Humorous Contradictions

TL;DR

This paper focuses on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction, and introduces the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics.

Abstract

Recent advancements in large multimodal language models have demonstrated remarkable proficiency across a wide range of tasks. Yet, these models still struggle with understanding the nuances of human humor through juxtaposition, particularly when it involves nonlinear narratives that underpin many jokes and humor cues. This paper investigates this challenge by focusing on comics with contradictory narratives, where each comic consists of two panels that create a humorous contradiction. We introduce the YesBut benchmark, which comprises tasks of varying difficulty aimed at assessing AI's capabilities in recognizing and interpreting these comics, ranging from literal content comprehension to deep narrative reasoning. Through extensive experimentation and analysis of recent commercial or open-sourced large (vision) language models, we assess their capability to comprehend the complex interplay of the narrative humor inherent in these comics. Our results show that even state-of-the-art models still lag behind human performance on this task. Our findings offer insights into the current limitations and potential improvements for AI in understanding human creative expressions.
Paper Structure (27 sections, 16 figures, 4 tables)

This paper contains 27 sections, 16 figures, 4 tables.

Figures (16)

  • Figure 1: We introduce YesBut dataset for comic understanding of juxtaposed comic panels. Given a two-panel comic with a contradictory narrative, we propose several tasks including narrative understanding, underlying philosophy selection and title matching, tackling different levels of comic understanding. (Comic by Anton Gudim).
  • Figure 2: Overview of the data construction pipeline. Pos represents the positive options, and Neg stands for the negative options.
  • Figure 3: Human Evaluation on literal description and contradiction generation.
  • Figure 4: LLMs using different image description as input.
  • Figure 5: VLMs with image only input and image + oracle description as inputs.
  • ...and 11 more figures