Table of Contents
Fetching ...

Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?

Ryosuke Kohita, Seiichiro Yoshioka

TL;DR

The Meme Reply Selection task is introduced and MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs consisting of openly licensed Japanese manga panels and social media posts is presented, suggesting that selecting contextually humorous replies remains an open challenge for current models.

Abstract

Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.

Memes-as-Replies: Can Models Select Humorous Manga Panel Responses?

TL;DR

The Meme Reply Selection task is introduced and MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs consisting of openly licensed Japanese manga panels and social media posts is presented, suggesting that selecting contextually humorous replies remains an open challenge for current models.

Abstract

Memes are a popular element of modern web communication, used not only as static artifacts but also as interactive replies within conversations. While computational research has focused on analyzing the intrinsic properties of memes, the dynamic and contextual use of memes to create humor remains an understudied area of web science. To address this gap, we introduce the Meme Reply Selection task and present MaMe-Re (Manga Meme Reply Benchmark), a benchmark of 100,000 human-annotated pairs (500,000 total annotations from 2,325 unique annotators) consisting of openly licensed Japanese manga panels and social media posts. Our analysis reveals three key insights: (1) large language models (LLMs) show preliminary evidence of capturing complex social cues such as exaggeration, moving beyond surface-level semantic matching; (2) the inclusion of visual information does not improve performance, revealing a gap between understanding visual content and effectively using it for contextual humor; (3) while LLMs can match human judgments in controlled settings, they struggle to distinguish subtle differences in wit among semantically similar candidates. These findings suggest that selecting contextually humorous replies remains an open challenge for current models.
Paper Structure (38 sections, 5 equations, 7 figures, 2 tables)

This paper contains 38 sections, 5 equations, 7 figures, 2 tables.

Figures (7)

  • Figure 1: Overview of memes-as-replies. (a) Example of meme use on SNS. (b) Visualization of the Meme Reply Selection task. (c) MaMe-Re benchmark with crowdsourced humor labels.
  • Figure 2: Crowdworker annotation interface and full task instruction for the funniness scoring task in MaMe-Re. Top: interface screenshot. Bottom: instruction text shown to annotators.
  • Figure 3: Prompt template. ${FORMAT} has "id, speech" or "id, speech, description" and ${CANDIDATES} have meme candidates in the corresponding csv format.
  • Figure 4: Main experimental results for Exp1. (a) Table showing the performance ranking across models and methods. S/P: similarity/preference-based; Y/N: with/without descriptions; CHR: Consensus Hit Rate; values in parentheses denote 95% confidence intervals. (b)--(d) Plots of score distributions categorized by panel descriptions, LLMs, and embedding models respectively.
  • Figure 5: Performance of the retrieve-and-rerank approach (Exp2).
  • ...and 2 more figures