Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Debarati Das; Karin De Langis; Anna Martin-Boyle; Jaehyung Kim; Minhwa Lee; Zae Myung Kim; Shirley Anugrah Hayati; Risako Owan; Bin Hu; Ritik Parkar; Ryan Koo; Jonginn Park; Aahan Tyagi; Libby Ferland; Sanjali Roy; Vincent Liu; Dongyeop Kang

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Debarati Das, Karin De Langis, Anna Martin-Boyle, Jaehyung Kim, Minhwa Lee, Zae Myung Kim, Shirley Anugrah Hayati, Risako Owan, Bin Hu, Ritik Parkar, Ryan Koo, Jonginn Park, Aahan Tyagi, Libby Ferland, Sanjali Roy, Vincent Liu, Dongyeop Kang

TL;DR

The paper investigates the artifacts and biases in data generated by large language models (LLMs) across five data types (Task Labels, Preferences, Instructions, Simulation, Free-Form Text) and benchmarks their quality against human data using first-order (data-level) and second-order (model-level) stress tests. It reveals that while LLM-generated data can reach human-level performance on some tasks, it exhibits systematic biases such as majority dominance, minority underrepresentation, locality biases, role-flipping in simulations, and simplified discourse patterns, which can be amplified during downstream training. By aggregating diverse datasets and applying a comprehensive stress-testing framework, the work highlights practical risks and ethical considerations, offering concrete recommendations for better data generation, evaluation, and documentation. The findings stress the need for human-in-the-loop or hybrid data strategies and transparent data provenance to ensure the reliability and fairness of LLM-based data ecosystems in real-world applications.

Abstract

This work delves into the expanding role of large language models (LLMs) in generating artificial data. LLMs are increasingly employed to create a variety of outputs, including annotations, preferences, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data often intersect in their application, they exert mutual influence on each other and raise significant concerns about the quality and diversity of the artificial data incorporated into training cycles, leading to an artificial data ecosystem. To the best of our knowledge, this is the first study to aggregate various types of LLM-generated text data, from more tightly constrained data like "task labels" to more lightly constrained "free-form text". We then stress test the quality and implications of LLM-generated artificial data, comparing it with human data across various existing benchmarks. Despite artificial data's capability to match human performance, this paper reveals significant hidden disparities, especially in complex tasks where LLMs often miss the nuanced understanding of intrinsic human-generated content. This study critically examines diverse LLM-generated data and emphasizes the need for ethical practices in data creation and when using LLMs. It highlights the LLMs' shortcomings in replicating human traits and behaviors, underscoring the importance of addressing biases and artifacts produced in LLM-generated content for future research and development. All data and code are available on our project page.

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

TL;DR

Abstract

Paper Structure (69 sections, 22 figures, 13 tables)

This paper contains 69 sections, 22 figures, 13 tables.

Introduction
Research Focus, Questions, and Scope
Main Findings
Contributions
Structure of Paper
Types of Artificial Data
Types of Stress Testing Methods
Overall Takeaways
Thematic Grouping of Detected Artifacts across LLM-generated Data Types
Artifacts in Task Labels
Related Work
Data
First Order Experiments
Findings on Majority and Minority Representation Comparison
Findings on Variation and Disagreement Analyses
...and 54 more sections

Figures (22)

Figure 1: An example of an artificial data ecosystem in which LLMs are increasingly employed to create a variety of outputs, including annotations, preference labels, instruction prompts, simulated dialogues, and free text. As these forms of LLM-generated data are often intertwined in their application, they exert mutual influence on each other within interconnected use cases. This interdependence raises significant concerns about the quality and diversity of the artificial data incorporated into training cycles. In this LLM ecosystem, there is a risk that AI systems may become predominantly or entirely dependent on artificially generated inputs.
Figure 2: Large Language Models often falter in unfamiliar scenarios, exhibiting biases and a lack of nuanced understanding of complex human opinions, and thus struggle to replicate human behavior in tasks such as problem-solving accurately. This leads to decreased performance in models trained on LLM-generated data containing biases and artifacts, underscoring the critical need to monitor and address these issues in LLM-generated content.
Figure 3: Overview of the five types of LLM-generated data and associated examples from the most tightly constrained output (left) to the most lightly constrained output (right) -- (1) Task Labels, (2) Preference, (3) Instructions, (4) Simulation, and (5) Free Form text. Sources for these examples in order: diaz2018addressing, kim2023p2c, honovich2022unnatural, liang2023encouraging and moller2023prompt.
Figure 4: Methods to stress test LLM-generated data for the first- and second-order experiments. In summary, the first-order experiments investigate the data "as-is," for example, focusing on their distributional differences and correlation patterns among human- and LLM-generated data; validating and analyzing using manual inspection; and counting how often labels flip between the original human and the resulting machine text. The second-order experiments involve fine-tuning LLMs on the machine-generated data and investigating whether the existing artifacts or biases are amplified.
Figure 5: Model accuracy for majority (M) and minority (m) match comparison on Sentiment dataset. ChatGPT has the highest majority match accuracy and the lowest minority match accuracy across all datasets, thus minority annotations tend to be represented inadequately.
...and 17 more figures

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

TL;DR

Abstract

Under the Surface: Tracking the Artifactuality of LLM-Generated Data

Authors

TL;DR

Abstract

Table of Contents

Figures (22)