Empowering Large Language Models for Textual Data Augmentation

Yichuan Li; Kaize Ding; Jianling Wang; Kyumin Lee

Empowering Large Language Models for Textual Data Augmentation

Yichuan Li, Kaize Ding, Jianling Wang, Kyumin Lee

TL;DR

This paper tackles the problem of creating high-quality, task-relevant textual data augmentations with limited manual effort. It introduces Self-LLMDA, a two-stage framework that (1) automatically generates a diverse pool of augmentation instructions via an LLM from a seed set and (2) employs a task-informed scoring model to select the most suitable instruction for a given downstream task, prompting the LLM to produce augmented data accordingly. Across 26 diverse few-shot NLP tasks, Self-LLMDA consistently outperforms non-LLM-based and manually crafted LLM-based augmentation methods, and demonstrates robust generalization to unseen augmentation instructions and other target models. The approach reduces reliance on human-crafted prompts while achieving higher-quality augmented data, with practical implications for scalable, instruction-driven data augmentation in real-world NLP applications. The work also provides comprehensive ablations, hyperparameter analyses, and cross-task transferability assessments to underscore the method’s versatility and limitations.

Abstract

With the capabilities of understanding and executing natural language instructions, Large language models (LLMs) can potentially act as a powerful tool for textual data augmentation. However, the quality of augmented data depends heavily on the augmentation instructions provided, and the effectiveness can fluctuate across different downstream tasks. While manually crafting and selecting instructions can offer some improvement, this approach faces scalability and consistency issues in practice due to the diversity of downstream tasks. In this work, we address these limitations by proposing a new solution, which can automatically generate a large pool of augmentation instructions and select the most suitable task-informed instructions, thereby empowering LLMs to create high-quality augmented data for different downstream tasks. Empirically, the proposed approach consistently generates augmented data with better quality compared to non-LLM and LLM-based data augmentation methods, leading to the best performance on 26 few-shot learning tasks sourced from a wide range of application domains.

Empowering Large Language Models for Textual Data Augmentation

TL;DR

Abstract

Paper Structure (47 sections, 9 equations, 8 figures, 13 tables)

This paper contains 47 sections, 9 equations, 8 figures, 13 tables.

Introduction
Related Work
Non-LLM Textual Data Augmentation
LLM-based Textual Data Augmentation
Preliminary
Problem Definition.
Manual-LLMDA.
Proposed Approach -- Self-LLMDA
Augmentation Instruction Self-Generation
Task-Informed Instruction Selection
Model Optimization.
Model Inference.
Experiment
Experimental Setup
Evaluation Datasets.
...and 32 more sections

Figures (8)

Figure 1: A simple demo of pronouns replacement augmentation instruction on text entailment task: GLUE-MRPCwang2019glue and question answering task: OpenBookQAmihaylov2018suit.
Figure 2: The pipeline of Self-LLMDA. We first prompt the LLM to generate a diverse set of candidate augmentation instructions (\ref{['sec:ins-gen']}). Then we select the instruction (\ref{['sec:selection']}) and apply it with the task data to LLM to get augmentations.
Figure 3: Illustration of the Instruction selection scoring model. $F_{\theta}$ is the target model.
Figure 4: Hyperparameter analysis of $n$, which dictates the number of augmentation instructions sampled during the training of the selection model.
Figure 6: Result of generalization to unknown augmentation instruction selection.
...and 3 more figures

Empowering Large Language Models for Textual Data Augmentation

TL;DR

Abstract

Empowering Large Language Models for Textual Data Augmentation

Authors

TL;DR

Abstract

Table of Contents

Figures (8)