Better Synthetic Data by Retrieving and Transforming Existing Datasets

Saumya Gandhi; Ritu Gala; Vijay Viswanathan; Tongshuang Wu; Graham Neubig

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

TL;DR

DataTune introduces a two-stage pipeline that repurposes public datasets for target NLP tasks by transforming them to task-aligned formats. It combines an LLM-driven dataset retrieval and a modular transformation process (task expansion, schema selection, planning, execution) to generate diverse, challenging training data. Across six BIG-Bench tasks, DataTune outperforms few-shot prompts and traditional data methods, and is synergistic when combined with synthetic data. The approach increases data diversity and hardness without sacrificing accuracy, offering a scalable path for task-specific fine-tuning in low-resource settings.

Abstract

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

Better Synthetic Data by Retrieving and Transforming Existing Datasets

TL;DR

Abstract

Paper Structure (30 sections, 6 figures, 3 tables)

This paper contains 30 sections, 6 figures, 3 tables.

Introduction
Problem Setup
Methods
Dataset Retrieval
Dataset Transformation
Task Expansion
Schema Selection
Planning Module
Execution Module
Using Multiple Datasets
Experimental Setup
Evaluation procedure
Methods
Dataset Creation Setup
Training Setup
...and 15 more sections

Figures (6)

Figure 1: Obtaining task-specific annotated data can be tricky. Existing solutions include (1) data generation methods either by employing human annotators (incurring high costs) or synthetically, such as using LLMs (risking low diversity) or (2) cross-task transfer, where related but task-misaligned datasets are used (for instance, for the task of generating English language descriptions based on code, this could be a public dataset with coding questions, solutions, and test cases but no explicit descriptions). Our approach combines these strategies by adaptively transforming existing datasets for the target task (using the "solution" field from the public dataset and asking an LLM to create description or make any formatting changes required) preserving original dataset diversity while ensuring the quality of synthetically generated data.
Figure 2: The data transformation component of DataTune, explained with an example (in yellow).
Figure 3: We show an example plan for the task of providing concise descriptions of Python code. The retrieved dataset contains natural language questions and code solutions. The plan then specifies that the transformation must create the correct description, create incorrect descriptions to create an multiple choice dataset, and format changes required to match the target task examples.
Figure 4: Synthetic dataset generation often suffers from the problem of generating multiple duplicates of the same example in a given dataset. On 3 of 5 tasks, we find that data transformation from retrieved datasets significantly mitigates this issue. The other two datasets, Russian and Temporal, represent failure modes of our system. Gold represents the BigBench Dataset for a given task.
Figure 5: Dataset Transformation leads to more difficult examples than synthetically generated examples, which are over-represented by easy examples, relative to manually-curated BIG-Bench evaluation datasets (Gold).
...and 1 more figures

Better Synthetic Data by Retrieving and Transforming Existing Datasets

TL;DR

Abstract

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Authors

TL;DR

Abstract

Table of Contents

Figures (6)