Table of Contents
Fetching ...

Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks

Minju Seo, Jinheon Baek, James Thorne, Sung Ju Hwang

TL;DR

The paper tackles data scarcity in domain-specific, low-resource tasks by introducing Retrieval-Augmented Data Augmentation (RADA), a framework that retrieves relevant external samples from a large data store $\\mathcal{C}$ and uses LLM prompts to generate new data with both in-context demonstrations and target-context information. By combining seed data $\\mathcal{D}$ with retrieved samples, RADA improves the diversity and relevance of augmented data, outperforming strong LLM-powered baselines in both training-time and test-time augmentation scenarios. Analyses show that retrieved data increases diversity (embedding dispersion) while maintaining alignment with seed content, and the approach is robust across different LLMs. The work provides practical gains for real-world low-resource domains and highlights the importance of cross-dataset retrieval for scalable data augmentation under privacy and data-access constraints.

Abstract

Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of seed data samples to use for data augmentation is very small, which makes generated samples suboptimal and less diverse. To tackle this challenge, we propose a novel method that augments training data by incorporating a wealth of examples from other datasets, along with the given training data. Specifically, we first retrieve the relevant instances from other datasets, such as their input-output pairs or contexts, based on their similarities with the given seed data, and then prompt LLMs to generate new samples with the contextual information within and across the original and retrieved samples. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone. We validate our proposed Retrieval-Augmented Data Augmentation (RADA) framework on multiple datasets under low-resource settings of training and test-time data augmentation scenarios, on which it outperforms existing LLM-powered data augmentation baselines.

Retrieval-Augmented Data Augmentation for Low-Resource Domain Tasks

TL;DR

The paper tackles data scarcity in domain-specific, low-resource tasks by introducing Retrieval-Augmented Data Augmentation (RADA), a framework that retrieves relevant external samples from a large data store and uses LLM prompts to generate new data with both in-context demonstrations and target-context information. By combining seed data with retrieved samples, RADA improves the diversity and relevance of augmented data, outperforming strong LLM-powered baselines in both training-time and test-time augmentation scenarios. Analyses show that retrieved data increases diversity (embedding dispersion) while maintaining alignment with seed content, and the approach is robust across different LLMs. The work provides practical gains for real-world low-resource domains and highlights the importance of cross-dataset retrieval for scalable data augmentation under privacy and data-access constraints.

Abstract

Despite large successes of recent language models on diverse tasks, they suffer from severe performance degeneration in low-resource settings with limited training data available. Many existing works tackle this problem by generating synthetic data from the training data and then training models on them, recently using Large Language Models (LLMs). However, in low-resource settings, the amount of seed data samples to use for data augmentation is very small, which makes generated samples suboptimal and less diverse. To tackle this challenge, we propose a novel method that augments training data by incorporating a wealth of examples from other datasets, along with the given training data. Specifically, we first retrieve the relevant instances from other datasets, such as their input-output pairs or contexts, based on their similarities with the given seed data, and then prompt LLMs to generate new samples with the contextual information within and across the original and retrieved samples. This approach can ensure that the generated data is not only relevant but also more diverse than what could be achieved using the limited seed data alone. We validate our proposed Retrieval-Augmented Data Augmentation (RADA) framework on multiple datasets under low-resource settings of training and test-time data augmentation scenarios, on which it outperforms existing LLM-powered data augmentation baselines.
Paper Structure (45 sections, 8 figures, 11 tables)

This paper contains 45 sections, 8 figures, 11 tables.

Figures (8)

  • Figure 1: (A) Low-Resource Tasks refer to problems (usually on the specific domains) where there is a limited amount of data available. (B) Existing Data Augmentation approaches expand the seed data with itself (policy for FMLA), which results in the limited diversity of the generated data samples (the same FMLA policy). (C) Our Retrieval-Augmented Data Augmentation (RADA) framework generates the new data with the external context (concurrent usage of FMLA and paternity leave), retrieved from the external datasets, along with the seed data, yielding more diverse and useful samples (paternity leave). (Upper Right:) Our RADA outperforms existing data augmentation methods, demonstrating the quality of generated samples. (Lower Right:) The generated data samples from RADA are more diverse than existing data augmentation, based on the t-SNE visualization.
  • Figure 2: RADA Framework Overview. We first retrieve the external instances (relevant to the seed data) from the external data store, and construct in-context and target-context of LLM prompts with the retrieved samples along with the seed data.
  • Figure 3: Breakdown results of retrieved instances on three domain-specific QA datasets, where samples in the retrieval pool are one of Biomedical, Computing, Film, Finance, Law, and Music domains, as well as NQ (which covers general domains).
  • Figure 4: Embedding-space visualization results of samples including the seed data and augmented data, with t-SNE.
  • Figure 5: Results of ROUGE-L score distributions measured between the seed data and generated data on Tech QA.
  • ...and 3 more figures