Table of Contents
Fetching ...

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

Yaping Chai, Haoran Xie, Joe S. Qin

TL;DR

This survey addresses data scarcity in training large language models by organizing data augmentation into Simple, Prompt-based, Retrieval-based, and Hybrid techniques. It documents how prompts and external retrieval complement LLM capabilities to generate grounded, diverse training data, and it details post-processing, tasks, and evaluation. The authors discuss granularity and modular categorizations (token to document level) to guide design choices and compare methods across NLP tasks. They also identify challenges such as hallucination, retrieval dependency, cost, and ethical risks, and outline opportunities for more robust and scalable augmentation in real-world LLM deployment.

Abstract

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.

Text Data Augmentation for Large Language Models: A Comprehensive Survey of Methods, Challenges, and Opportunities

TL;DR

This survey addresses data scarcity in training large language models by organizing data augmentation into Simple, Prompt-based, Retrieval-based, and Hybrid techniques. It documents how prompts and external retrieval complement LLM capabilities to generate grounded, diverse training data, and it details post-processing, tasks, and evaluation. The authors discuss granularity and modular categorizations (token to document level) to guide design choices and compare methods across NLP tasks. They also identify challenges such as hallucination, retrieval dependency, cost, and ethical risks, and outline opportunities for more robust and scalable augmentation in real-world LLM deployment.

Abstract

The increasing size and complexity of pre-trained language models have demonstrated superior performance in many applications, but they usually require large training datasets to be adequately trained. Insufficient training sets could unexpectedly make the model overfit and fail to cope with complex tasks. Large language models (LLMs) trained on extensive corpora have prominent text generation capabilities, which improve the quality and quantity of data and play a crucial role in data augmentation. Specifically, distinctive prompt templates are given in personalised tasks to guide LLMs in generating the required content. Recent promising retrieval-based techniques further improve the expressive performance of LLMs in data augmentation by introducing external knowledge to enable them to produce more grounded-truth data. This survey provides an in-depth analysis of data augmentation in LLMs, classifying the techniques into Simple Augmentation, Prompt-based Augmentation, Retrieval-based Augmentation and Hybrid Augmentation. We summarise the post-processing approaches in data augmentation, which contributes significantly to refining the augmented data and enabling the model to filter out unfaithful content. Then, we provide the common tasks and evaluation metrics. Finally, we introduce existing challenges and future opportunities that could bring further improvement to data augmentation.

Paper Structure

This paper contains 58 sections, 4 figures, 4 tables.

Figures (4)

  • Figure 1: Four categories of data augmentation techniques.
  • Figure 2: Recent studies on four categories of data augmentation techniques. As mentioned in section \ref{['sec-DAmethods']}, Hybrid Augmentation technique combines superior few-shot learning capabilities similar to prompt engineering and a retriever to obtain external knowledge. Using only the prompt portion of the RAG itself, we categorise it as the Retrieval-based Augmentation technique.
  • Figure 3: Data augmentation techniques in the No Prompt-Basic-Advanced spectrum according to Prompt Complexity and No Retrieval-Basic-Advanced spectrum according to Retrieval Model Complexity.
  • Figure 4: Detailed data augmentation methods for four techniques. For a better understanding, the grey font means that the paper is from Hybrid Augmentation, and we could see from the figure how Hybrid Augmentation designs the prompt and performs the retrieval.