Table of Contents
Fetching ...

Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

Kuan Lok Zhou, Jiayi Chen, Siddharth Suresh, Reuben Narad, Timothy T. Rogers, Lalit K Jain, Robert D Nowak, Bob Mankoff, Jifan Zhang

TL;DR

This work tackles the gap in humor understanding by decomposing LLM capabilities into visual understanding, humor reasoning, and audience-preference alignment. The authors integrate improved visual annotation, explicit humor explanations, and two alignment strategies, finding that crowd-preference fine-tuning yields the largest gains, achieving 82.4% accuracy on easy caption-pair ranking and approaching human expert performance. Persona-based prompting shows limited value, highlighting fundamental challenges in modeling subgroup preferences for subjective tasks. The results suggest that advancing creative understanding in AI may require extensive, domain-specific human preference data and careful alignment to diverse audiences, with implications for pursuing AGI in creative domains.

Abstract

Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest (NYCCC). Their study exposed a substantial gap between LLMs and humans in humor comprehension, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor understanding into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 82.4% accuracy in caption ranking, singificantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we propose the position that achieving artificial general intelligence necessitates systematic collection of human preference data across creative domains. We advocate that just as human creativity is deeply influenced by individual and cultural preferences, training LLMs with diverse human preference data may be essential for developing true creative understanding.

Bridging the Creativity Understanding Gap: Small-Scale Human Alignment Enables Expert-Level Humor Ranking in LLMs

TL;DR

This work tackles the gap in humor understanding by decomposing LLM capabilities into visual understanding, humor reasoning, and audience-preference alignment. The authors integrate improved visual annotation, explicit humor explanations, and two alignment strategies, finding that crowd-preference fine-tuning yields the largest gains, achieving 82.4% accuracy on easy caption-pair ranking and approaching human expert performance. Persona-based prompting shows limited value, highlighting fundamental challenges in modeling subgroup preferences for subjective tasks. The results suggest that advancing creative understanding in AI may require extensive, domain-specific human preference data and careful alignment to diverse audiences, with implications for pursuing AGI in creative domains.

Abstract

Large Language Models (LLMs) have shown significant limitations in understanding creative content, as demonstrated by Hessel et al. (2023)'s influential work on the New Yorker Cartoon Caption Contest (NYCCC). Their study exposed a substantial gap between LLMs and humans in humor comprehension, establishing that understanding and evaluating creative content is key challenge in AI development. We revisit this challenge by decomposing humor understanding into three components and systematically improve each: enhancing visual understanding through improved annotation, utilizing LLM-generated humor reasoning and explanations, and implementing targeted alignment with human preference data. Our refined approach achieves 82.4% accuracy in caption ranking, singificantly improving upon the previous 67% benchmark and matching the performance of world-renowned human experts in this domain. Notably, while attempts to mimic subgroup preferences through various persona prompts showed minimal impact, model finetuning with crowd preferences proved remarkably effective. These findings reveal that LLM limitations in creative judgment can be effectively addressed through focused alignment to specific subgroups and individuals. Lastly, we propose the position that achieving artificial general intelligence necessitates systematic collection of human preference data across creative domains. We advocate that just as human creativity is deeply influenced by individual and cultural preferences, training LLMs with diverse human preference data may be essential for developing true creative understanding.

Paper Structure

This paper contains 21 sections, 5 figures, 5 tables.

Figures (5)

  • Figure 1: Our work improve over state-of-art caption ranking through a three-stage process. With multimodel LLM assistance, we manually fix visual understanding and cartoon description flaws. Our framework also incorporates o1 reasoning capabilities in explaining a joke, before utilizing two different alignment methods to align an LLM preferences with the human preferences from the NYYCC. Our experiments demonstrate that we are achieving human expert level accuracy in this caption ranking task.
  • Figure 2: Composition of cartoon caption contest datasets across hessel-etal-2023-androids, zhang2024humor and our paper. In our paper, we examine $20$ pairs of captions selected from 379 contests (#510-#889). The dataset is further split into 279 contest for training and 100 for testing.
  • Figure 3: Example voting page for the caption contest.
  • Figure 4: Examples of three types of errors in machine-generated cartoon descriptions and their human-annotated corrections. Left: Minor errors in word choice ("tourists" vs. "clerk", "map" vs. "hotdogs"). Center: Omission of key narrative details (missing the humorous implication of eagles gossiping about another eagle's appearance). Right: Fundamentally incorrect scene interpretation (misidentifying two snakes as a turtle and snake).
  • Figure 5: Comparison of humor explanation quality between GPT-4o and o1-preview, illustrated through two cartoon-caption pairs and their respective AI-generated humor explanation. o1-preview demonstrates a deeper comprehension of the humor, and its explanations are highlighted in bold text.