On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang; Yuxi Sun; Jing Ma; Xin Huang

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang

TL;DR

This work addresses the challenge of generating funny captions for cartoons by proposing HOMER, a humor-generation framework grounded in the General Theory of Verbal Humor (GTVH). HOMER uses three coordinated LLM roles—conflicting-script extractor, hierarchical imaginator, and caption generator—augmented with a humor-retrieval module to ground and expand humor through script oppositions and imaginative associations. Key contributions include a modular, interpretable pipeline, a hierarchical imaginator with local/global views and joke-retrieval, and a novel humor-relevance scoring mechanism that balances semantic similarity and conceptual opposition. Empirical results on two New Yorker cartoon datasets show significant improvements over state-of-the-art baselines in automatic metrics ($pass@k$) and human evaluations, with robust performance across different base LLMs and low harmful-content rates. The framework offers a principled, controllable approach to multimodal humor generation with potential for generalization to other humorous domains and modalities.

Abstract

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

TL;DR

) and human evaluations, with robust performance across different base LLMs and low harmful-content rates. The framework offers a principled, controllable approach to multimodal humor generation with potential for generalization to other humorous domains and modalities.

Abstract

Paper Structure (38 sections, 12 equations, 11 figures, 22 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 11 figures, 22 tables, 1 algorithm.

Introduction
HOMER
conflicting script extractor
Hierarchical imaginator
Caption Generator
Experiments
Reliability of Humor Evaluator
Funny Caption Generation
Ablation Studies
Case study
Human Evaluation
Harmful Detection
Related Work
Conclusions
HOMER Algorithm.
...and 23 more sections

Figures (11)

Figure 1: A comparison of our HOMER with GPT-4o and CLoT models in funny caption generation.
Figure 2: Framework of HOMER with three LLM-based roles: (a) Conflicting script extractor, deriving a detailed situation description and conflicting scripts as the basis of humor generation. (b) Hierarchical imaginator, identifying and enhancing the humor target with multi-view LLM associations and humor-relevance retrieval imagination. (c) Caption generator, generating funny and diverse captions conditioned on the obtained knowledge.
Figure 3: Ablation study of humor-relevance score.
Figure 4: $k$ hyperparameter.
Figure 5: $\delta$ hyperparameter.
...and 6 more figures

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

TL;DR

Abstract

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Authors

TL;DR

Abstract

Table of Contents

Figures (11)