Table of Contents
Fetching ...

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

Veniamin Veselovsky, Manoel Horta Ribeiro, Robert West

TL;DR

The paper investigates the prevalence of LLM usage among crowd workers on MTurk for text production tasks, focusing on abstract summarization. It develops a synthetic-vs-real classifier trained on real and ChatGPT-generated summaries and validates its results with keystroke-based data. The study estimates that 33-46% of submitted summaries were LLM-assisted, with high detector accuracy (summary-level 99%, abstract-level 97%) and supportive post-hoc analyses. These findings imply that crowd-produced text data may already be substantially machine-produced, prompting platforms and researchers to develop bespoke detection strategies and rethink how human data remain human.

Abstract

Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk

Artificial Artificial Artificial Intelligence: Crowd Workers Widely Use Large Language Models for Text Production Tasks

TL;DR

The paper investigates the prevalence of LLM usage among crowd workers on MTurk for text production tasks, focusing on abstract summarization. It develops a synthetic-vs-real classifier trained on real and ChatGPT-generated summaries and validates its results with keystroke-based data. The study estimates that 33-46% of submitted summaries were LLM-assisted, with high detector accuracy (summary-level 99%, abstract-level 97%) and supportive post-hoc analyses. These findings imply that crowd-produced text data may already be substantially machine-produced, prompting platforms and researchers to develop bespoke detection strategies and rethink how human data remain human.

Abstract

Large language models (LLMs) are remarkable data annotators. They can be used to generate high-fidelity supervised training data, as well as survey and experimental data. With the widespread adoption of LLMs, human gold--standard annotations are key to understanding the capabilities of LLMs and the validity of their results. However, crowdsourcing, an important, inexpensive way to obtain human annotations, may itself be impacted by LLMs, as crowd workers have financial incentives to use LLMs to increase their productivity and income. To investigate this concern, we conducted a case study on the prevalence of LLM usage by crowd workers. We reran an abstract summarization task from the literature on Amazon Mechanical Turk and, through a combination of keystroke detection and synthetic text classification, estimate that 33-46% of crowd workers used LLMs when completing the task. Although generalization to other, less LLM-friendly tasks is unclear, our results call for platforms, researchers, and crowd workers to find new ways to ensure that human data remain human, perhaps using the methodology proposed here as a stepping stone. Code/data: https://github.com/epfl-dlab/GPTurk
Paper Structure (8 sections, 4 figures, 2 tables)

This paper contains 8 sections, 4 figures, 2 tables.

Figures (4)

  • Figure 1: Illustration of our approach for quantifying the prevalence of LLM usage among crowd workers solving a text summarization task. First, we use truly human-written MTurk responses and synthetic LLM-written responses to train a task-specific synthetic-vs.-real classifier. Second, we use this classifier on real MTurk responses (where workers may or may not have relied on LLMs), estimating the prevalence of LLM usage. Additionally (not shown), we confirm the validity of our results in a post-hoc analysis of keystroke data collected alongside MTurk responses.
  • Figure 2: Depiction of the MTurk task studied in this paper, where crowd workers were asked to condense research abstracts from the New England Journal of Medicine into summaries about 100 words long.
  • Figure 3: Proportion of summaries predicted as synthetic depending on the logit threshold.
  • Figure 4: Overlap between summaries and original abstracts (operationalized as ratio of length of longest common substring and length of original abstract), for summaries involving a paste action.