Table of Contents
Fetching ...

"A good pun is its own reword": Can Large Language Models Understand Puns?

Zhijun Xu, Siyu Yuan, Lingjie Chen, Deqing Yang

TL;DR

This work systematically probes large language models for pun understanding across three tasks: recognition, explanation, and generation. It introduces novel evaluation methods tailored to in-context learning, including dual-biased prompts, punchline checks, CoT prompts, and an Overlap metric to assess originality. Across eight LLMs and two pun types, the study finds prompt bias significantly shapes recognition, that explanations struggle with het-puns yet can reach human-level quality in some models, and that generation shows a prevalent lazy-pun pattern but can achieve strong results in constrained setups, especially with larger models. The results advance our understanding of pun processing in LLMs and provide robust evaluation frameworks and datasets to guide future research in linguistic humor and creative text generation.

Abstract

Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM's ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.

"A good pun is its own reword": Can Large Language Models Understand Puns?

TL;DR

This work systematically probes large language models for pun understanding across three tasks: recognition, explanation, and generation. It introduces novel evaluation methods tailored to in-context learning, including dual-biased prompts, punchline checks, CoT prompts, and an Overlap metric to assess originality. Across eight LLMs and two pun types, the study finds prompt bias significantly shapes recognition, that explanations struggle with het-puns yet can reach human-level quality in some models, and that generation shows a prevalent lazy-pun pattern but can achieve strong results in constrained setups, especially with larger models. The results advance our understanding of pun processing in LLMs and provide robust evaluation frameworks and datasets to guide future research in linguistic humor and creative text generation.

Abstract

Puns play a vital role in academic research due to their distinct structure and clear definition, which aid in the comprehensive analysis of linguistic humor. However, the understanding of puns in large language models (LLMs) has not been thoroughly examined, limiting their use in creative writing and humor creation. In this paper, we leverage three popular tasks, i.e., pun recognition, explanation and generation to systematically evaluate the capabilities of LLMs in pun understanding. In addition to adopting the automated evaluation metrics from prior research, we introduce new evaluation methods and metrics that are better suited to the in-context learning paradigm of LLMs. These new metrics offer a more rigorous assessment of an LLM's ability to understand puns and align more closely with human cognition than previous metrics. Our findings reveal the "lazy pun generation" pattern and identify the primary challenges LLMs encounter in understanding puns.
Paper Structure (49 sections, 5 equations, 8 figures, 17 tables)

This paper contains 49 sections, 5 equations, 8 figures, 17 tables.

Figures (8)

  • Figure 1: Toy examples of achieving three representative tasks related to pun understanding with LLMs, including pun recognition, explanation and generation. We explore the primary difficulties (e.g., paradoxical response, missing alternative word and lazy pattern) in these tasks.
  • Figure 2: The performance of four selected LLMs in recognizing puns via direct answers and CoT responses. The Acc metric represents the overall accuracy.
  • Figure 3: Results of pairwise comparison for pun explanations
  • Figure 4: Contextual word incorporation rate of different LLMs in constrained pun generation
  • Figure 5: Average overlap, success, and strict success of two methods for generating puns.
  • ...and 3 more figures