Table of Contents
Fetching ...

Predicting Winning Captions for Weekly New Yorker Comics

Stanley Cao, Sonny Young

TL;DR

The study tackles the challenge of generating humorous captions for New Yorker cartoons by evaluating multiple Vision Transformer–based captioning pipelines. It compares a CLIP-GPT2 baseline, LLaVA-NeXT with zero-/few-/CoT prompting and QLoRA finetuning, and GPT-4V on a New Yorker dataset augmented with metadata. Automated metrics (BLEU/ROUGE) underperform for humor-rich captioning, while human judgments (SS-SCORE) favor GPT-4V with few-shot prompts, highlighting the importance of model scale and prompting strategy. The results reveal significant gaps between automated metrics and perceived caption quality, underscoring the need for better evaluation paradigms and prompting techniques to model culturally nuanced humor in multi-modal AI systems.

Abstract

Image captioning using Vision Transformers (ViTs) represents a pivotal convergence of computer vision and natural language processing, offering the potential to enhance user experiences, improve accessibility, and provide textual representations of visual data. This paper explores the application of image captioning techniques to New Yorker cartoons, aiming to generate captions that emulate the wit and humor of winning entries in the New Yorker Cartoon Caption Contest. This task necessitates sophisticated visual and linguistic processing, along with an understanding of cultural nuances and humor. We propose several new baselines for using vision transformer encoder-decoder models to generate captions for the New Yorker cartoon caption contest.

Predicting Winning Captions for Weekly New Yorker Comics

TL;DR

The study tackles the challenge of generating humorous captions for New Yorker cartoons by evaluating multiple Vision Transformer–based captioning pipelines. It compares a CLIP-GPT2 baseline, LLaVA-NeXT with zero-/few-/CoT prompting and QLoRA finetuning, and GPT-4V on a New Yorker dataset augmented with metadata. Automated metrics (BLEU/ROUGE) underperform for humor-rich captioning, while human judgments (SS-SCORE) favor GPT-4V with few-shot prompts, highlighting the importance of model scale and prompting strategy. The results reveal significant gaps between automated metrics and perceived caption quality, underscoring the need for better evaluation paradigms and prompting techniques to model culturally nuanced humor in multi-modal AI systems.

Abstract

Image captioning using Vision Transformers (ViTs) represents a pivotal convergence of computer vision and natural language processing, offering the potential to enhance user experiences, improve accessibility, and provide textual representations of visual data. This paper explores the application of image captioning techniques to New Yorker cartoons, aiming to generate captions that emulate the wit and humor of winning entries in the New Yorker Cartoon Caption Contest. This task necessitates sophisticated visual and linguistic processing, along with an understanding of cultural nuances and humor. We propose several new baselines for using vision transformer encoder-decoder models to generate captions for the New Yorker cartoon caption contest.
Paper Structure (21 sections, 1 equation, 7 figures)

This paper contains 21 sections, 1 equation, 7 figures.

Figures (7)

  • Figure 1: Detailed Information including selected metadata entries for New Yorker Cartoon Caption Contest #102
  • Figure 2: Modified target label with metadata prepended for the cartoon in Figure \ref{['dataset_example']}
  • Figure 3: 0-shot prompt for LLaVA-NeXT and GPT-4V. The <image> token refers to the placement of the image embedding for the LLaVA-NeXT Model.
  • Figure 4: 5-shot prompt for LLaVA-NeXT. The <image> token refers to the placement of the image embeddings. The final user input asks the model to generate a caption after it has seen 5 previous examples of human-written winning captions.
  • Figure 5: Chain-of-Thought prompting for both LLaVA-NeXT and GPT-4V.
  • ...and 2 more figures