Table of Contents
Fetching ...

Diversity-oriented Data Augmentation with Large Language Models

Zaitian Wang, Jinghan Zhang, Xinhao Zhang, Kunpeng Liu, Pengfei Wang, Yuanchun Zhou

TL;DR

This paper addresses the insufficient attention given to sample distribution diversity in NLP data augmentation. It proposes DoAug, a framework that uses a diversity-enhanced LLM paraphraser fine-tuned with PEFT and optimized via Direct Preference Optimization, combined with coreset-based selective data augmentation. DoAug preserves high affinity while expanding diversity, achieving an average downstream performance gain of 10.52% across 12 datasets and superior diversity metrics, with robust results across multiple LLM architectures and downstream models. The work highlights practical strategies for balancing diversity and coherence in augmented data and lays groundwork for broader evaluations of diversity in NLP datasets.

Abstract

Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: \textit{Insufficient Attention to Sample Distribution Diversity}. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation's impact on dataset diversity and propose a \textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data \textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). % \(\mathscr{DoAug}\) Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of \(10.52\%\), surpassing the runner-up baseline with more than three percentage points.

Diversity-oriented Data Augmentation with Large Language Models

TL;DR

This paper addresses the insufficient attention given to sample distribution diversity in NLP data augmentation. It proposes DoAug, a framework that uses a diversity-enhanced LLM paraphraser fine-tuned with PEFT and optimized via Direct Preference Optimization, combined with coreset-based selective data augmentation. DoAug preserves high affinity while expanding diversity, achieving an average downstream performance gain of 10.52% across 12 datasets and superior diversity metrics, with robust results across multiple LLM architectures and downstream models. The work highlights practical strategies for balancing diversity and coherence in augmented data and lays groundwork for broader evaluations of diversity in NLP datasets.

Abstract

Data augmentation is an essential technique in natural language processing (NLP) for enriching training datasets by generating diverse samples. This process is crucial for improving the robustness and generalization capabilities of NLP models. However, a significant challenge remains: \textit{Insufficient Attention to Sample Distribution Diversity}. Most existing methods focus on increasing the sample numbers while neglecting the sample distribution diversity, which can lead to model overfitting. In response, we explore data augmentation's impact on dataset diversity and propose a \textbf{\underline{D}}iversity-\textbf{\underline{o}}riented data \textbf{\underline{Aug}}mentation framework (\textbf{DoAug}). % Specifically, we utilize a diversity-oriented fine-tuning approach to train an LLM as a diverse paraphraser, which is capable of augmenting textual datasets by generating diversified paraphrases. Then, we apply the LLM paraphraser to a selected coreset of highly informative samples and integrate the paraphrases with the original data to create a more diverse augmented dataset. Finally, we conduct extensive experiments on 12 real-world textual datasets. The results show that our fine-tuned LLM augmenter improves diversity while preserving label consistency, thereby enhancing the robustness and performance of downstream tasks. Specifically, it achieves an average performance gain of , surpassing the runner-up baseline with more than three percentage points.

Paper Structure

This paper contains 44 sections, 6 equations, 14 figures, 8 tables, 1 algorithm.

Figures (14)

  • Figure 1: Conceptual comparison of DoAug (right) generating coherent and diverse samples against baselines (left) generating noisy or repetitive samples.
  • Figure 2: An overall framework of DoAug.
  • Figure 3: Diversity, affinity, and performance achieved by DoAug and baseline methods. Results are averaged on 12 datasets and the diversity rankings are further averaged on 6 metrics in this diagram. A smaller number for the rankings indicates better results.
  • Figure 4: Affinity scores of DoAug and 10 baseline methods. The scores are averaged on 12 datasets.
  • Figure 5: Ablation study on diversity gains
  • ...and 9 more figures