Table of Contents
Fetching ...

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

Jifan Zhang, Lalit Jain, Yang Guo, Jiayi Chen, Kuan Lok Zhou, Siddharth Suresh, Andrew Wagenmaker, Scott Sievert, Timothy Rogers, Kevin Jamieson, Robert Mankoff, Robert Nowak

TL;DR

The paper presents a massive, crowd-sourced dataset of humor ratings from The New Yorker Caption Contest, enabling rigorous study of alignment for humorous caption generation in multimodal LLMs. It delivers a new HumorAI Benchmark with group-based evaluation (Group Overall and Best Pick) and holds out 91 contests to compare AI captions against top human submissions, using GPT-4 and human judges for ranking. Through experiments with open-source and closed-source models, it shows current LLMs lag behind top human captions, analyzes alignment methods (SFT, RLHF, DPO, BoN), and observes that DPO often improves Best Pick performance while BoN boosts overall win rates but can reduce diversity. The work highlights the challenges of humor as a guiding objective, demonstrates the value of large-scale human preference data, and provides open-source datasets and tools to advance AI humor generation and evaluation, with implications for humor, culture, and AI safety in creative tasks.

Abstract

We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.

Humor in AI: Massive Scale Crowd-Sourced Preferences and Benchmarks for Cartoon Captioning

TL;DR

The paper presents a massive, crowd-sourced dataset of humor ratings from The New Yorker Caption Contest, enabling rigorous study of alignment for humorous caption generation in multimodal LLMs. It delivers a new HumorAI Benchmark with group-based evaluation (Group Overall and Best Pick) and holds out 91 contests to compare AI captions against top human submissions, using GPT-4 and human judges for ranking. Through experiments with open-source and closed-source models, it shows current LLMs lag behind top human captions, analyzes alignment methods (SFT, RLHF, DPO, BoN), and observes that DPO often improves Best Pick performance while BoN boosts overall win rates but can reduce diversity. The work highlights the challenges of humor as a guiding objective, demonstrates the value of large-scale human preference data, and provides open-source datasets and tools to advance AI humor generation and evaluation, with implications for humor, culture, and AI safety in creative tasks.

Abstract

We present a novel multimodal preference dataset for creative tasks, consisting of over 250 million human ratings on more than 2.2 million captions, collected through crowdsourcing rating data for The New Yorker's weekly cartoon caption contest over the past eight years. This unique dataset supports the development and evaluation of multimodal large language models and preference-based fine-tuning algorithms for humorous caption generation. We propose novel benchmarks for judging the quality of model-generated captions, utilizing both GPT4 and human judgments to establish ranking-based evaluation strategies. Our experimental results highlight the limitations of current fine-tuning methods, such as RLHF and DPO, when applied to creative tasks. Furthermore, we demonstrate that even state-of-the-art models like GPT4 and Claude currently underperform top human contestants in generating humorous captions. As we conclude this extensive data collection effort, we release the entire preference dataset to the research community, fostering further advancements in AI humor generation and evaluation.
Paper Structure (31 sections, 2 equations, 3 figures, 10 tables, 1 algorithm)

This paper contains 31 sections, 2 equations, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: Overview of our workflow. During data collection, a new cartoon is released each week and thousands of captions are submitted. We then collect caption ratings through a crowd-sourcing procedure driven by a bandit algorithm. Our dataset is a collection of 365 contests, over 2.2M captions and over 250M human ratings. This dataset is utilized for our Humor generation task and benchmark. We experiment with finetuned open-source models and close-sourced API calls (both LLMs and MLLMs). Our novel and low-cost evaluator provides better reliability in evaluating captions.
  • Figure 2: Example voting page for contest 895
  • Figure 5: Example caption generations for contest #895 (cartoon in \ref{['fig:voting']})