Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Zhuoyan Li; Hangxiao Zhu; Zhuoran Lu; Ming Yin

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Zhuoyan Li, Hangxiao Zhu, Zhuoran Lu, Ming Yin

TL;DR

The paper investigates the viability of using large language models to generate synthetic data for text classification and systematically analyzes how task subjectivity moderates effectiveness. It compares zero-shot and few-shot synthetic data generation using GPT-3.5-Turbo across ten diverse tasks, training BERT/RoBERTa classifiers, and evaluating with Macro-F1 and accuracy. Key findings show that real data surpass synthetic data, but few-shot guidance improves synthetic data effectiveness, with small gains on low-subjectivity tasks and substantial drops on highly subjective ones; instance-level subjectivity further amplifies these effects. The study highlights the importance of data diversity and prompts future work on improving synthetic data through human-in-the-loop methods and broader LLM exploration, with implications for practitioners deciding whether to rely on synthetic data for new classification tasks.

Abstract

The collection and curation of high-quality training data is crucial for developing text classification models with superior performance, but it is often associated with significant costs and time investment. Researchers have recently explored using large language models (LLMs) to generate synthetic datasets as an alternative approach. However, the effectiveness of the LLM-generated synthetic data in supporting model training is inconsistent across different classification tasks. To better understand factors that moderate the effectiveness of the LLM-generated synthetic data, in this study, we look into how the performance of models trained on these synthetic data may vary with the subjectivity of classification. Our results indicate that subjectivity, at both the task level and instance level, is negatively associated with the performance of the model trained on synthetic data. We conclude by discussing the implications of our work on the potential and limitations of leveraging LLM for synthetic data generation.

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

TL;DR

Abstract

Paper Structure (29 sections, 1 equation, 5 figures, 9 tables)

This paper contains 29 sections, 1 equation, 5 figures, 9 tables.

Introduction
Related Work
Methodolgy
Zero-shot Synthetic Data Generation
Few-shot Synthetic Data Generation
Evaluation I: Comparison Across Different Types of Tasks
Datasets and Tasks
Task-level Subjectivity Determination
Model Training
Evaluation Results
Exploratory Analysis: Data Diversity
Evaluation II: Comparison Across Different Task Instances
Instance-level Subjectivity Determination
Evaluation Results
Conclusions and Discussions
...and 14 more sections

Figures (5)

Figure 1: Comparing the diversity of the real-world data and the synthetic data.
Figure 2: Changes in the accuracy of the BERT model trained on zero-shot synthetic data as the instance-level annotation agreement threshold varies. The solid blue line in each plot is the linear regression fitted on the data, and the $R$-squared score quantifies the goodness of fit. The Spearman's $\rho$ assesses the strength of rank correlation between the instance-level agreement threshold and the model accuracy for each task. Higher values for both $R$-squared and Spearman's $\rho$, ideally close to $1$, indicate a stronger monotonic relationship between the instance-level subjectivity and the model accuracy.
Figure B.1: The training curves for classification models trained with the real-world data, the zero-shot synthetic data, and the few-shot synthetic data.
Figure B.2: Average top 5 cosine similarity between the real and synthetic data
Figure C.1: Changes in the accuracy of the BERT model trained on real-world data as the instance-level annotation agreement threshold varies. The solid blue line in each plot is the linear regression fitted on the data, and the $R$-squared score quantifies the goodness of fit. The Spearman's $\rho$ assesses the strength of rank correlation between the instance-level agreement threshold and the model accuracy for each task. Higher values for both $R$-squared and Spearman's $\rho$, ideally close to $1$, indicate a stronger monotonic relationship between the instance-level subjectivity and the model accuracy.

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

TL;DR

Abstract

Synthetic Data Generation with Large Language Models for Text Classification: Potential and Limitations

Authors

TL;DR

Abstract

Table of Contents

Figures (5)