Data Augmentation using Large Language Models: Data Perspectives, Learning Paradigms and Challenges
Bosheng Ding, Chengwei Qin, Ruochen Zhao, Tianze Luo, Xinze Li, Guizhen Chen, Wenhan Xia, Junjie Hu, Anh Tuan Luu, Shafiq Joty
TL;DR
This survey analyzes data augmentation for large language models from data perspectives and learning paradigms. It introduces a four-way taxonomy for DA with LLMs (data creation, labeling, reformation, and co-annotation) and outlines methods in generative and discriminative learning within a Teacher-Student framework. It covers techniques such as supervised instruction learning, in-context learning, alignment learning, pseudo data generation, and data scoring, highlighting their applications across NLP and multimodal tasks. It also discusses open challenges—data contamination, controllable augmentation, culture-aware multilinguality, multimodality, and privacy—and offers directions to guide future research and practice.
Abstract
In the rapidly evolving field of large language models (LLMs), data augmentation (DA) has emerged as a pivotal technique for enhancing model performance by diversifying training examples without the need for additional data collection. This survey explores the transformative impact of LLMs on DA, particularly addressing the unique challenges and opportunities they present in the context of natural language processing (NLP) and beyond. From both data and learning perspectives, we examine various strategies that utilize LLMs for data augmentation, including a novel exploration of learning paradigms where LLM-generated data is used for diverse forms of further training. Additionally, this paper highlights the primary open challenges faced in this domain, ranging from controllable data augmentation to multi-modal data augmentation. This survey highlights a paradigm shift introduced by LLMs in DA, and aims to serve as a comprehensive guide for researchers and practitioners.
