Measuring AI "Slop" in Text
Chantal Shaib, Tuhin Chakrabarty, Diego Garcia-Olano, Byron C. Wallace
TL;DR
This work addresses the lack of a formal, measurable definition of AI-generated 'slop' by introducing a taxonomy organized into Information Utility, Information Quality, and Style Quality, and validating it with fine-grained, span-level expert annotations on 150 news articles and 100 retrieval QA passages. The study shows that binary 'slop' judgments are somewhat subjective but correlate with latent dimensions such as relevance, coherence, and factuality, with domain-specific differences in which axes matter most. Automatic measurement approaches partially capture slop signals (AUPRC ≈ 0.52–0.55; WQRM correlations), but LLMs as judges or span extractors struggle to fully replicate human assessments, highlighting limitations of current automatic metrics and reward models. The taxonomy and annotated data provide a framework for domain-aware evaluation of AI-generated text and guide future improvements in automatic slop detection and text-quality modeling.
Abstract
AI "slop" is an increasingly popular term used to describe low-quality AI-generated text, but there is currently no agreed upon definition of this term nor a means to measure its occurrence. In this work, we develop a taxonomy of "slop" through interviews with experts in NLP, writing, and philosophy, and propose a set of interpretable dimensions for its assessment in text. Through span-level annotation, we find that binary "slop" judgments are (somewhat) subjective, but such determinations nonetheless correlate with latent dimensions such as coherence and relevance. Our framework can be used to evaluate AI-generated text in both detection and binary preference tasks, potentially offering new insights into the linguistic and stylistic factors that contribute to quality judgments.
