Table of Contents
Fetching ...

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

Shiwei Hong, Lingyao Li, Ethan Z. Rong, Chenxinran Shen, Zhicong Lu

TL;DR

This work test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion.

Abstract

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.

Multi-Agent Comedy Club: Investigating Community Discussion Effects on LLM Humor Generation

TL;DR

This work test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion.

Abstract

Prior work has explored multi-turn interaction and feedback for LLM writing, but evaluations still largely center on prompts and localized feedback, leaving persistent public reception in online communities underexamined. We test whether broadcast community discussion improves stand-up comedy writing in a controlled multi-agent sandbox: in the discussion condition, critic and audience threads are recorded, filtered, stored as social memory, and later retrieved to condition subsequent generations, whereas the baseline omits discussion. Across 50 rounds (250 paired monologues) judged by five expert annotators using A/B preference and a 15-item rubric, discussion wins 75.6% of instances and improves Craft/Clarity (Δ = 0.440) and Social Response (Δ = 0.422), with occasional increases in aggressive humor.
Paper Structure (67 sections, 4 equations, 5 figures, 4 tables)

This paper contains 67 sections, 4 equations, 5 figures, 4 tables.

Figures (5)

  • Figure 1: Overview of Multi-Agent Comedy Club. In each round, a host prompts five performer agents to write stand-up comedy monologues. When enabled, a broadcast discussion produces threaded reception (critique, scores, and reactions) that is stored in social memory and retrieved to condition later rounds. We extract paired outputs from the simulation and a baseline model without discussion simulation, and evaluate them with dedicated human annotators via forced A/B preference and multi-dimensional rubric ratings.
  • Figure 2: Workflow overview of our multi-agent sandbox. Left: baseline ($g{=}0$) skips discussion and logs performances only. Right: community discussion ($g{=}1$) adds an iterative discussion loop that produces reception, which is written to social memory at the end of round $t$ and retrieved to condition performers at the start of round $t{+}1$.
  • Figure 3: Visualization of a discussion thread in our setting. A thread groups reception events that are topically and referentially linked, including an initiating post (e.g., a critic review) and subsequent audience posts or free-dialogue replies.
  • Figure 4: Round-to-round dynamics. (a) Round-level mean differences $\Delta=\text{Discussion}-\text{Baseline}$ for Craft/Clarity (Q1--Q6), Social Response (Q12--Q15), and Moral Pressure (Q11). We report Humor Style direction with HarmShift$=\mathrm{mean}(\Delta Q9,\Delta Q10)-\mathrm{mean}(\Delta Q7,\Delta Q8)$ (higher = more harmful shift). (b) The instance-level Q0 majority preference rate for Discussion in each round.
  • Figure 5: Benefit--safety tradeoff. Each point is a paired instance (topic$\times$performer$\times$round). Benefit (x-axis; z-scored, higher is better) averages gains in amusement (Q1), craft (Q2--Q6), downstream impact (Q12--Q15), and centered preference share ($\mathrm{PrefShare}-0.5$). Safety (y-axis; z-scored, higher is better) is the negative mean of moral/value-judgment pressure shift (Q11) and style-direction shift HarmShift. Dashed crosshairs mark dataset means (z=0). Red X marks indicate Pareto-efficient instances; panel (b) highlights the win--win quadrant (Benefit$\ge0$, Safety$\ge0$).