From Punchlines to Predictions: A Metric to Assess LLM Performance in Identifying Humor in Stand-Up Comedy
Adrianna Romanowski, Pedro H. V. Valois, Kazuhiro Fukui
TL;DR
This work tackles the problem of evaluating LLMs’ ability to identify humor in stand-up transcripts, a highly subjective domain. It introduces a modular Humor Detection Metric with three scoring modalities—$score^{fuzzy}$, $score^{embed}$, and $score^{subspace}$—that compare model outputs $M$ to ground truth $G$ derived from audience laughter via forced alignment. Across 51 transcripts and multiple models, including ChatGPT, Claude, Gemma, Llama, Phi, and DeepSeek, the study finds that even the best systems peak around 51% accuracy while humans reach about 41%, highlighting the persistent difficulty of computational humor understanding. The work also reveals substantial human–machine agreement variability and emphasizes the need for multimodal evaluation; code and data are made available to support reproducibility and further research in humor-aware AI systems.
Abstract
Comedy serves as a profound reflection of the times we live in and is a staple element of human interactions. In light of the widespread adoption of Large Language Models (LLMs), the intersection of humor and AI has become no laughing matter. Advancements in the naturalness of human-computer interaction correlates with improvements in AI systems' abilities to understand humor. In this study, we assess the ability of models in accurately identifying humorous quotes from a stand-up comedy transcript. Stand-up comedy's unique comedic narratives make it an ideal dataset to improve the overall naturalness of comedic understanding. We propose a novel humor detection metric designed to evaluate LLMs amongst various prompts on their capability to extract humorous punchlines. The metric has a modular structure that offers three different scoring methods - fuzzy string matching, sentence embedding, and subspace similarity - to provide an overarching assessment of a model's performance. The model's results are compared against those of human evaluators on the same task. Our metric reveals that regardless of prompt engineering, leading models, ChatGPT, Claude, and DeepSeek, achieve scores of at most 51% in humor detection. Notably, this performance surpasses that of humans who achieve a score of 41%. The analysis of human evaluators and LLMs reveals variability in agreement, highlighting the subjectivity inherent in humor and the complexities involved in extracting humorous quotes from live performance transcripts. Code available at https://github.com/swaggirl9000/humor.
