Table of Contents
Fetching ...

Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

Venkata S Govindarajan, Laura Biester

TL;DR

This work introduces a novel corpus of intentionally bad humor drawn from the Bulwer-Lytton Fiction Contest, coupled with synthetic BL sentences generated by multiple large language models. It shows that BL humor diverges markedly from standard humor datasets, with stronger use of literary devices like Irony, Metafiction, and Simile, and a prevalence of novel adjective-noun expressions. Humor-detection models trained on conventional data underperform on BL, indicating a domain-specific gap; synthetic BL sentences imitate the form but exaggerate stylistic features, offering a lens into how prompts shape generation. The study combines data collection, humor-detection evaluation, literary device analysis via a GPT-based feature framework, and surprisal-based incongruity analysis to map the distinctive landscape of BL humor and its susceptibility to synthetic replication, with public data and code to enable further research.

Abstract

Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

Dark & Stormy: Modeling Humor in the Worst Sentences Ever Written

TL;DR

This work introduces a novel corpus of intentionally bad humor drawn from the Bulwer-Lytton Fiction Contest, coupled with synthetic BL sentences generated by multiple large language models. It shows that BL humor diverges markedly from standard humor datasets, with stronger use of literary devices like Irony, Metafiction, and Simile, and a prevalence of novel adjective-noun expressions. Humor-detection models trained on conventional data underperform on BL, indicating a domain-specific gap; synthetic BL sentences imitate the form but exaggerate stylistic features, offering a lens into how prompts shape generation. The study combines data collection, humor-detection evaluation, literary device analysis via a GPT-based feature framework, and surprisal-based incongruity analysis to map the distinctive landscape of BL humor and its susceptibility to synthetic replication, with public data and code to enable further research.

Abstract

Textual humor is enormously diverse and computational studies need to account for this range, including intentionally bad humor. In this paper, we curate and analyze a novel corpus of sentences from the Bulwer-Lytton Fiction Contest to better understand "bad" humor in English. Standard humor detection models perform poorly on our corpus, and an analysis of literary devices finds that these sentences combine features common in existing humor datasets (e.g., puns, irony) with metaphor, metafiction and simile. LLMs prompted to synthesize contest-style sentences imitate the form but exaggerate the effect by over-using certain literary devices, and including far more novel adjective-noun bigrams than human writers. Data, code and analysis are available at https://github.com/venkatasg/bulwer-lytton

Paper Structure

This paper contains 25 sections, 8 figures, 2 tables.

Figures (8)

  • Figure 1: An entry from the Bulwer-Lytton contest, with the key literary devices highlighted.
  • Figure 2: Comparing humor detected in BL sentences to crowd-first/combo-humor with the combined humor model (L). Comparing humor detected in BL sentences with the pun subset of BL/PotD with the pun model (R).
  • Figure 3: Relative rank of high-surprisal tokens plotted against their relative sentence position for our 5 datasets.
  • Figure 4: Literary device presence across datasets.
  • Figure 5: Cumulative distribution of percentage of ANs within a dataset against their count in the DCLM corpus.
  • ...and 3 more figures