Better Synthetic Data by Retrieving and Transforming Existing Datasets
Saumya Gandhi, Ritu Gala, Vijay Viswanathan, Tongshuang Wu, Graham Neubig
TL;DR
DataTune introduces a two-stage pipeline that repurposes public datasets for target NLP tasks by transforming them to task-aligned formats. It combines an LLM-driven dataset retrieval and a modular transformation process (task expansion, schema selection, planning, execution) to generate diverse, challenging training data. Across six BIG-Bench tasks, DataTune outperforms few-shot prompts and traditional data methods, and is synergistic when combined with synthetic data. The approach increases data diversity and hardness without sacrificing accuracy, offering a scalable path for task-specific fine-tuning in low-resource settings.
Abstract
Despite recent advances in large language models, building dependable and deployable NLP models typically requires abundant, high-quality training data. However, task-specific data is not available for many use cases, and manually curating task-specific data is labor-intensive. Recent work has studied prompt-driven synthetic data generation using large language models, but these generated datasets tend to lack complexity and diversity. To address these limitations, we introduce a method, DataTune, to make better use of existing, publicly available datasets to improve automatic dataset generation. DataTune performs dataset transformation, enabling the repurposing of publicly available datasets into a format that is directly aligned with the specific requirements of target tasks. On a diverse set of language-based tasks from the BIG-Bench benchmark, we find that finetuning language models via DataTune improves over a few-shot prompting baseline by 49% and improves over existing methods that use synthetic or retrieved training data by 34%. We find that dataset transformation significantly increases the diversity and difficulty of generated data on many tasks. We integrate DataTune into an open-source repository to make this method accessible to the community: https://github.com/neulab/prompt2model.
