Table of Contents
Fetching ...

AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

Ximing Lu, Melanie Sclar, Skyler Hallinan, Niloofar Mireshghallah, Jiacheng Liu, Seungju Han, Allyson Ettinger, Liwei Jiang, Khyathi Chandu, Nouha Dziri, Yejin Choi

TL;DR

This paper introduces the Creativity Index, a scalable metric that quantifies linguistic creativity by estimating how much of a text can be reconstructed from vast web snippets, and DJ Search, a dynamic-programming algorithm that efficiently locates verbatim and near-verbatim $n$-grams in a reference corpus. By comparing machine-generated texts from multiple LLMs with human-authored texts across novel writing, poetry, and speeches, the study shows that professional human authors exhibit substantially higher creativity than LLMs, and that RLHF alignment reduces surface-form diversity in model outputs. The work further demonstrates that semantic matching enhances the detected creativity gap and that the Creativity Index can serve as a robust zero-shot detector for machine-generated text, outperforming leading baselines in many domains. Collectively, these findings offer a principled, quantitative lens on AI creativity, reveal the impact of training and alignment on linguistic novelty, and propose a practical tool for distinguishing human from machine text in real-world settings.

Abstract

Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.

AI as Humanity's Salieri: Quantifying Linguistic Creativity of Language Models via Systematic Attribution of Machine Text against Web Text

TL;DR

This paper introduces the Creativity Index, a scalable metric that quantifies linguistic creativity by estimating how much of a text can be reconstructed from vast web snippets, and DJ Search, a dynamic-programming algorithm that efficiently locates verbatim and near-verbatim -grams in a reference corpus. By comparing machine-generated texts from multiple LLMs with human-authored texts across novel writing, poetry, and speeches, the study shows that professional human authors exhibit substantially higher creativity than LLMs, and that RLHF alignment reduces surface-form diversity in model outputs. The work further demonstrates that semantic matching enhances the detected creativity gap and that the Creativity Index can serve as a robust zero-shot detector for machine-generated text, outperforming leading baselines in many domains. Collectively, these findings offer a principled, quantitative lens on AI creativity, reveal the impact of training and alignment on linguistic novelty, and propose a practical tool for distinguishing human from machine text in real-world settings.

Abstract

Creativity has long been considered one of the most difficult aspect of human intelligence for AI to mimic. However, the rise of Large Language Models (LLMs), like ChatGPT, has raised questions about whether AI can match or even surpass human creativity. We present CREATIVITY INDEX as the first step to quantify the linguistic creativity of a text by reconstructing it from existing text snippets on the web. CREATIVITY INDEX is motivated by the hypothesis that the seemingly remarkable creativity of LLMs may be attributable in large part to the creativity of human-written texts on the web. To compute CREATIVITY INDEX efficiently, we introduce DJ SEARCH, a novel dynamic programming algorithm that can search verbatim and near-verbatim matches of text snippets from a given document against the web. Experiments reveal that the CREATIVITY INDEX of professional human authors is on average 66.2% higher than that of LLMs, and that alignment reduces the CREATIVITY INDEX of LLMs by an average of 30.1%. In addition, we find that distinguished authors like Hemingway exhibit measurably higher CREATIVITY INDEX compared to other human writers. Finally, we demonstrate that CREATIVITY INDEX can be used as a surprisingly effective criterion for zero-shot machine text detection, surpassing the strongest existing zero-shot system, DetectGPT, by a significant margin of 30.2%, and even outperforming the strongest supervised system, GhostBuster, in five out of six domains.
Paper Structure (39 sections, 2 equations, 30 figures, 4 algorithms)

This paper contains 39 sections, 2 equations, 30 figures, 4 algorithms.

Figures (30)

  • Figure 1: a: Example outputs from DJ Search. We asked ChatGPT to generate an abstract based on the title of Prof. Michele Elam’s paper, "Poetry Will Not Optimize; or, What Is Literature to AI?" Elam2023PoetryWN The abstract generated by ChatGPT contains significantly more verbatim and near-verbatim matches with existing texts on the web compared to the original abstract written by Prof. Elam. b: Definition of Creativity Index.Creativity Index is mathematically equivalent to the area under the $L$-uniqueness curve across a range of minimum $n$-gram lengths $L$. The $L$-uniqueness of ChatGPT is noticeably lower than that of proficient human writers across various context granularities (i.e., $n$-gram lengths) in all domains, leading to a significantly higher Creativity Index for human writers compared to ChatGPT.
  • Figure 2: An illustration of DJ Search algorithm. A brute force approach would independently check if every $n$-gram of $\mathbf{x}$ occurs in $C$, performing a quadratic number of $f$ evaluations with respect to $\mathbf{x}$'s length (i.e., checking every cell in the grid). DJ Search is a two-pointer method that takes only a linear number of $f$ evaluations. By progressively analyzing $n$-grams starting and/or ending at a later endpoint than before, DJ Search limits the total number of $f$ evaluations to $2||\textbf{x}||$. In this example, the minimum $n$-gram length $L$ is set to 5.
  • Figure 3: a-c: Creativity Index in novel writing (a), poetry composition (b) and speech writing (c) based solely on verbatim matches. d: Creativity Index in novel writing considering both verbatim and semantic matches. e: $L$-uniqueness in novel writing with respect to the minimum $n$-gram length $L$ for humans and OLMo. f-g: Creativity Index of LLMs before and after RLHF in novel writing, based solely on verbatim matches (f) and based on both verbatim and semantic matches (g). h: $L$-uniqueness in novel writing with respect to number of documents in the reference corpus. i: $L$-uniqueness when search over the top 50 documents in novel writing. j: The number of reference documents required to keep $L$-uniqueness below 50% in novel writing. k-l: Creativity Index of GPT-4 compared to humans in novel writing based on verbatim matches, using a machine-generated reference corpus sourced from the instruction-aligned version of Gemma-7B, Llama3-8B, and Mixtral-7B, as well as a combination of all three. m: Creativity Index of different groups of human writers. n: Detection AUROC across various domains: our approach sets a new state-of-the-art for zero-shot detection, even surpassing supervised baselines.
  • Figure 4: a-c: Creativity Index of ChatGPT in novel writing based on verbatim matches, with different prompt formats (a), $p$ values in top-p decoding (b) and prompt length (c). d: Creativity Index of LLaMA 2 Chat and Tulu 2 with different model sizes.
  • Figure 5: Example outputs from DJ Search based on both verbatim and semantic matches. We prompt LLMs to generate a few paragraphs of a novel, beginning with a first sentence taken from a human-written novel snippet.
  • ...and 25 more figures