Table of Contents
Fetching ...

Better Synthetic Data by Retrieving and Transforming Existing Datasets

Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig

TL;DR

DataTune introduces a two-stage pipeline that repurposes public datasets for target NLP tasks by transforming them to task-aligned formats. It combines an LLM-driven dataset retrieval and a modular transformation process (task expansion, schema selection, planning, execution) to generate diverse, challenging training data. Across six BIG-Bench tasks, DataTune outperforms few-shot prompts and traditional data methods, and is synergistic when combined with synthetic data. The approach increases data diversity and hardness without sacrificing accuracy, offering a scalable path for task-specific fine-tuning in low-resource settings.

Abstract

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.

Better Synthetic Data by Retrieving and Transforming Existing Datasets

TL;DR

DataTune introduces a two-stage pipeline that repurposes public datasets for target NLP tasks by transforming them to task-aligned formats. It combines an LLM-driven dataset retrieval and a modular transformation process (task expansion, schema selection, planning, execution) to generate diverse, challenging training data. Across six BIG-Bench tasks, DataTune outperforms few-shot prompts and traditional data methods, and is synergistic when combined with synthetic data. The approach increases data diversity and hardness without sacrificing accuracy, offering a scalable path for task-specific fine-tuning in low-resource settings.

Abstract

Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.
Paper Structure (30 sections, 6 figures, 3 tables)

This paper contains 30 sections, 6 figures, 3 tables.

Figures (6)

  • Figure 1: Obtaining task-specific annotated data can be tricky. Existing solutions include (1) data generation methods either by employing human annotators (incurring high costs) or synthetically, such as using LLMs (risking low diversity) or (2) cross-task transfer, where related but task-misaligned datasets are used (for instance, for the task of generating English language descriptions based on code, this could be a public dataset with coding questions, solutions, and test cases but no explicit descriptions). Our approach combines these strategies by adaptively transforming existing datasets for the target task (using the "solution" field from the public dataset and asking an LLM to create description or make any formatting changes required) preserving original dataset diversity while ensuring the quality of synthetically generated data.
  • Figure 2: The data transformation component of DataTune, explained with an example (in yellow).
  • Figure 3: We show an example plan for the task of providing concise descriptions of Python code. The retrieved dataset contains natural language questions and code solutions. The plan then specifies that the transformation must create the correct description, create incorrect descriptions to create an multiple choice dataset, and format changes required to match the target task examples.
  • Figure 4: Synthetic dataset generation often suffers from the problem of generating multiple duplicates of the same example in a given dataset. On 3 of 5 tasks, we find that data transformation from retrieved datasets significantly mitigates this issue. The other two datasets, Russian and Temporal, represent failure modes of our system. Gold represents the BigBench Dataset for a given task.
  • Figure 5: Dataset Transformation leads to more difficult examples than synthetically generated examples, which are over-represented by easy examples, relative to manually-curated BIG-Bench evaluation datasets (Gold).
  • ...and 1 more figures