Table of Contents
Fetching ...

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

Wenbo Shang, Yuxi Sun, Jing Ma, Xin Huang

TL;DR

This work addresses the challenge of generating funny captions for cartoons by proposing HOMER, a humor-generation framework grounded in the General Theory of Verbal Humor (GTVH). HOMER uses three coordinated LLM roles—conflicting-script extractor, hierarchical imaginator, and caption generator—augmented with a humor-retrieval module to ground and expand humor through script oppositions and imaginative associations. Key contributions include a modular, interpretable pipeline, a hierarchical imaginator with local/global views and joke-retrieval, and a novel humor-relevance scoring mechanism that balances semantic similarity and conceptual opposition. Empirical results on two New Yorker cartoon datasets show significant improvements over state-of-the-art baselines in automatic metrics ($pass@k$) and human evaluations, with robust performance across different base LLMs and low harmful-content rates. The framework offers a principled, controllable approach to multimodal humor generation with potential for generalization to other humorous domains and modalities.

Abstract

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.

On the Wings of Imagination: Conflicting Script-based Multi-role Framework for Humor Caption Generation

TL;DR

This work addresses the challenge of generating funny captions for cartoons by proposing HOMER, a humor-generation framework grounded in the General Theory of Verbal Humor (GTVH). HOMER uses three coordinated LLM roles—conflicting-script extractor, hierarchical imaginator, and caption generator—augmented with a humor-retrieval module to ground and expand humor through script oppositions and imaginative associations. Key contributions include a modular, interpretable pipeline, a hierarchical imaginator with local/global views and joke-retrieval, and a novel humor-relevance scoring mechanism that balances semantic similarity and conceptual opposition. Empirical results on two New Yorker cartoon datasets show significant improvements over state-of-the-art baselines in automatic metrics () and human evaluations, with robust performance across different base LLMs and low harmful-content rates. The framework offers a principled, controllable approach to multimodal humor generation with potential for generalization to other humorous domains and modalities.

Abstract

Humor is a commonly used and intricate human language in daily life. Humor generation, especially in multi-modal scenarios, is a challenging task for large language models (LLMs), which is typically as funny caption generation for images, requiring visual understanding, humor reasoning, creative imagination, and so on. Existing LLM-based approaches rely on reasoning chains or self-improvement, which suffer from limited creativity and interpretability. To address these bottlenecks, we develop a novel LLM-based humor generation mechanism based on a fundamental humor theory, GTVH. To produce funny and script-opposite captions, we introduce a humor-theory-driven multi-role LLM collaboration framework augmented with humor retrieval (HOMER). The framework consists of three LLM-based roles: (1) conflicting-script extractor that grounds humor in key script oppositions, forming the basis of caption generation; (2) retrieval-augmented hierarchical imaginator that identifies key humor targets and expands the creative space of them through diverse associations structured as imagination trees; and (3) caption generator that produces funny and diverse captions conditioned on the obtained knowledge. Extensive experiments on two New Yorker Cartoon benchmarking datasets show that HOMER outperforms state-of-the-art baselines and powerful LLM reasoning strategies on multi-modal humor captioning.
Paper Structure (38 sections, 12 equations, 11 figures, 22 tables, 1 algorithm)

This paper contains 38 sections, 12 equations, 11 figures, 22 tables, 1 algorithm.

Figures (11)

  • Figure 1: A comparison of our HOMER with GPT-4o and CLoT models in funny caption generation.
  • Figure 2: Framework of HOMER with three LLM-based roles: (a) Conflicting script extractor, deriving a detailed situation description and conflicting scripts as the basis of humor generation. (b) Hierarchical imaginator, identifying and enhancing the humor target with multi-view LLM associations and humor-relevance retrieval imagination. (c) Caption generator, generating funny and diverse captions conditioned on the obtained knowledge.
  • Figure 3: Ablation study of humor-relevance score.
  • Figure 4: $k$ hyperparameter.
  • Figure 5: $\delta$ hyperparameter.
  • ...and 6 more figures