Investigating Wit, Creativity, and Detectability of Large Language Models in Domain-Specific Writing Style Adaptation of Reddit's Showerthoughts
Tolga Buz, Benjamin Frost, Nikola Genchev, Moritz Schneider, Lucie-Aimée Kaffee, Gerard de Melo
TL;DR
This paper investigates how well different LLMs can imitate Reddit's Showerthoughts in short, witty texts and whether humans and automated detectors can tell AI-generated from human-authored content. The authors fine-tune GPT-2 and GPT-Neo on Showerthoughts, and generate additional samples with GPT-3.5-turbo in a zero-shot setup, then evaluate outputs via a human survey across logical validity, creativity, humor, and cleverness, plus a real-person perception question. They also train RoBERTa-based classifiers to distinguish AI-generated from human-written texts and analyze token-level contributions to predictions. The results show AI-generated Showerthoughts reach near-human quality, with humans often unable to reliably detect them, while RoBERTa detectors perform robustly, highlighting both the potential for convincing short-form creative writing and the need for reliable detection tools in real-world use cases. The work contributes a new Showerthoughts dataset, multi-model generation and evaluation, and insights into detection strategies and their limitations, with implications for creative writing, marketing, and content moderation.
Abstract
Recent Large Language Models (LLMs) have shown the ability to generate content that is difficult or impossible to distinguish from human writing. We investigate the ability of differently-sized LLMs to replicate human writing style in short, creative texts in the domain of Showerthoughts, thoughts that may occur during mundane activities. We compare GPT-2 and GPT-Neo fine-tuned on Reddit data as well as GPT-3.5 invoked in a zero-shot manner, against human-authored texts. We measure human preference on the texts across the specific dimensions that account for the quality of creative, witty texts. Additionally, we compare the ability of humans versus fine-tuned RoBERTa classifiers to detect AI-generated texts. We conclude that human evaluators rate the generated texts slightly worse on average regarding their creative quality, but they are unable to reliably distinguish between human-written and AI-generated texts. We further provide a dataset for creative, witty text generation based on Reddit Showerthoughts posts.
