Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning
Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak
TL;DR
The paper presents a massive, crowd-sourced dataset of humor ratings from The New Yorker Caption Contest, enabling rigorous study of alignment for humorous caption generation in multimodal LLMs. It delivers a new HumorAI Benchmark with group-based evaluation (Group Overall and Best Pick) and holds out 91 contests to compare AI captions against top human submissions, using GPT-4 and human judges for ranking. Through experiments with open-source and closed-source models, it shows current LLMs lag behind top human captions, analyzes alignment methods (SFT, RLHF, DPO, BoN), and observes that DPO often improves Best Pick performance while BoN boosts overall win rates but can reduce diversity. The work highlights the challenges of humor as a guiding objective, demonstrates the value of large-scale human preference data, and provides open-source datasets and tools to advance AI humor generation and evaluation, with implications for humor, culture, and AI safety in creative tasks.
Abstract
We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.
