Data Generation Using Large Language Models for Text Classification: An Empirical Case Study
Yinheng Li, Rogerio Bonatti, Sara Abdali, Justin Wagle, Kazuhito Koishida
TL;DR
This study empirically evaluates LLM-generated synthetic data for text classification, examining how prompt type, data volume, and bias affect downstream performance. Using GPT-3.5-turbo to generate $1000$ synthetic examples per task across six NLP tasks and RoBERTa as the classifier, the authors compare synthetic-only and augmented training under zero-shot, one-shot, and few-shot prompts, including zero-shot topic prompts. Key findings show that mixing a modest amount of raw data with synthetic data yields reliable gains, synthetic data is most beneficial in low-resource settings, and diversity, bias, and prompt strategy strongly influence outcomes, with LLM performance not reliably predicting downstream results. The paper offers practical guidance on class-conditioned prompting, domain-aligned topic generation, and iterative prompt refinement to balance cost, quality, and coverage in synthetic data for text classification, advancing cost-effective data augmentation practice.
Abstract
Using Large Language Models (LLMs) to generate synthetic data for model training has become increasingly popular in recent years. While LLMs are capable of producing realistic training data, the effectiveness of data generation is influenced by various factors, including the choice of prompt, task complexity, and the quality, quantity, and diversity of the generated data. In this work, we focus exclusively on using synthetic data for text classification tasks. Specifically, we use natural language understanding (NLU) models trained on synthetic data to assess the quality of synthetic data from different generation approaches. This work provides an empirical analysis of the impact of these factors and offers recommendations for better data generation practices.
