Table of Contents
Fetching ...

Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor

Ashwin Baluja

TL;DR

Problem: computational humor understanding remains hard for LLMs because humor is multimodal and phonetic. Approach: a training-free multimodal prompting framework that injects TTS-generated audio alongside text, using chain-of-thought prompts and aggregation to produce explanations. Findings: across SemEval, Context-Situated Puns, and ExplainTheJoke, adding audio yields consistent improvements in explanation quality, including preserving phonetic ambiguity in pun words. Significance: the method is simple to implement with off-the-shelf TTS and API-based LLMs, offering a practical boost to humor understanding without model fine-tuning.

Abstract

While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.

Text Is Not All You Need: Multimodal Prompting Helps LLMs Understand Humor

TL;DR

Problem: computational humor understanding remains hard for LLMs because humor is multimodal and phonetic. Approach: a training-free multimodal prompting framework that injects TTS-generated audio alongside text, using chain-of-thought prompts and aggregation to produce explanations. Findings: across SemEval, Context-Situated Puns, and ExplainTheJoke, adding audio yields consistent improvements in explanation quality, including preserving phonetic ambiguity in pun words. Significance: the method is simple to implement with off-the-shelf TTS and API-based LLMs, offering a practical boost to humor understanding without model fine-tuning.

Abstract

While Large Language Models (LLMs) have demonstrated impressive natural language understanding capabilities across various text-based tasks, understanding humor has remained a persistent challenge. Humor is frequently multimodal, relying on phonetic ambiguity, rhythm and timing to convey meaning. In this study, we explore a simple multimodal prompting approach to humor understanding and explanation. We present an LLM with both the text and the spoken form of a joke, generated using an off-the-shelf text-to-speech (TTS) system. Using multimodal cues improves the explanations of humor compared to textual prompts across all tested datasets.

Paper Structure

This paper contains 23 sections, 6 figures, 5 tables.

Figures (6)

  • Figure 1: Multimodal prompting strategy overview. Left - separate text and audio explanations are generated, then aggregated. Right - combined text and audio are processed together for a single explanation. Aggregation refers to prompting intended to ensure a coherent output that does not represent the multiple input modalities.
  • Figure 2: Multimodal prompting strategy where the LLM has both audio and text passed in at once.
  • Figure 3: Logits of transcribing an audio file containing the word, "Where".
  • Figure 4: Logits of transcribing an audio file containing a pun, focusing on the pun-word, "Weight".
  • Figure 5: Explanation prompt for an LLM, including examples and an example input pun.
  • ...and 1 more figures