Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Yinheng Li; Rogerio Bonatti; Sara Abdali; Justin Wagle; Kazuhito Koishida

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida

TL;DR

This study empirically evaluates LLM-generated synthetic data for text classification, examining how prompt type, data volume, and bias affect downstream performance. Using GPT-3.5-turbo to generate $1000$ synthetic examples per task across six NLP tasks and RoBERTa as the classifier, the authors compare synthetic-only and augmented training under zero-shot, one-shot, and few-shot prompts, including zero-shot topic prompts. Key findings show that mixing a modest amount of raw data with synthetic data yields reliable gains, synthetic data is most beneficial in low-resource settings, and diversity, bias, and prompt strategy strongly influence outcomes, with LLM performance not reliably predicting downstream results. The paper offers practical guidance on class-conditioned prompting, domain-aligned topic generation, and iterative prompt refinement to balance cost, quality, and coverage in synthetic data for text classification, advancing cost-effective data augmentation practice.

Abstract

Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

TL;DR

This study empirically evaluates LLM-generated synthetic data for text classification, examining how prompt type, data volume, and bias affect downstream performance. Using GPT-3.5-turbo to generate

synthetic examples per task across six NLP tasks and RoBERTa as the classifier, the authors compare synthetic-only and augmented training under zero-shot, one-shot, and few-shot prompts, including zero-shot topic prompts. Key findings show that mixing a modest amount of raw data with synthetic data yields reliable gains, synthetic data is most beneficial in low-resource settings, and diversity, bias, and prompt strategy strongly influence outcomes, with LLM performance not reliably predicting downstream results. The paper offers practical guidance on class-conditioned prompting, domain-aligned topic generation, and iterative prompt refinement to balance cost, quality, and coverage in synthetic data for text classification, advancing cost-effective data augmentation practice.

Abstract

Paper Structure (23 sections, 5 figures, 5 tables)

This paper contains 23 sections, 5 figures, 5 tables.

Introduction
Related Work
Data Augmentation
Large Language models (LLMs)
Methods
Experiments
Key Findings
Mixing Raw Data is Necessary
Impact of Bias
Relationship between LLM Performance and Data Quality
Synthetic Data is Helpful Mostly in Low-Resource Settings
A Comparison Between Different Prompting Methods
Synthetic Data Diversity and Similarity to Raw Data
Synthetic Data Quantity
Data Generation Techniques in Practice
...and 8 more sections

Figures (5)

Figure 1: Pipeline for Data Augmentation using LLM
Figure 2: Performance of different prompting methods with and without augmentation. Synthetic only: use 1000 synthetic data only. Augmented: 1000 synthetic data plus 100 raw data
Figure 3: Improvement on Different Raw Data Amount. raw data (x) is only using X number of raw data points. augmented (x) is using X amount raw data points plus 100 synthetic data. For augmented f1 score, it is the average model performance on the data generated by 5 different prompting methods
Figure 4: Synthetic Data Similarity
Figure 5: Impact on Synthetic Data Quantity

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

TL;DR

Abstract

Data Generation Using Large Language Models for Text Classification: An Empirical Case Study

Authors

TL;DR

Abstract

Table of Contents

Figures (5)