ZeroShotDataAug: Generating and Augmenting Training Data with ChatGPT
Solomon Ubani, Suleyman Olcay Polat, Rodney Nielsen
TL;DR
The paper tackles data scarcity in NLP by using zero-shot prompting of ChatGPT to generate synthetic training data for low-resource tasks. It benchmarks against established augmentation methods across SST-2, SNIPS, and TREC, and introduces a similarity-based evaluation to assess data quality and contamination. Results show that zero-shot ChatGPT augmentation often outperforms baselines (notably on SST-2 and TREC) and remains effective even with minimal or no original training data, highlighting its practical potential for low-resource settings. The study also demonstrates that carefully crafted prompts significantly influence augmentation quality, and it provides a framework for evaluating synthetic data using multiple similarity metrics. Overall, the approach offers a scalable, low-annotation-data solution for improving NLP model generalization in resource-constrained environments.
Abstract
In this paper, we investigate the use of data obtained from prompting a large generative language model, ChatGPT, to generate synthetic training data with the aim of augmenting data in low resource scenarios. We show that with appropriate task-specific ChatGPT prompts, we outperform the most popular existing approaches for such data augmentation. Furthermore, we investigate methodologies for evaluating the similarity of the augmented data generated from ChatGPT with the aim of validating and assessing the quality of the data generated.
