Table of Contents
Fetching ...

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

Ingo Ziegler, Abdullatif Köksal, Desmond Elliott, Hinrich Schütze

TL;DR

CRAFT tackles the challenge of producing large-scale, task-specific fine-tuning data from minimal human input by retrieving relevant web documents and augmenting them into custom task formats with an instruction-tuned LLM. The approach combines an on-site, diverse embedding corpus with retrieval-guided synthesis to generate synthetic samples across biology, medicine, commonsense QA, and summarization, achieving performance that rivals or surpasses instruction-tuned baselines and human-curated data in several settings. Across extensive experiments, CRAFT demonstrates robust data scaling, strong generalization to out-of-domain tasks, and greater stability compared with fully synthetic methods, though it exhibits limitations in recipe-generation scaling that motivate future quality-control mechanisms. The work provides a scalable, task-agnostic pipeline that reduces manual data curation while enabling domain-specific fine-tuning for diverse downstream tasks.

Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.

CRAFT Your Dataset: Task-Specific Synthetic Dataset Generation Through Corpus Retrieval and Augmentation

TL;DR

CRAFT tackles the challenge of producing large-scale, task-specific fine-tuning data from minimal human input by retrieving relevant web documents and augmenting them into custom task formats with an instruction-tuned LLM. The approach combines an on-site, diverse embedding corpus with retrieval-guided synthesis to generate synthetic samples across biology, medicine, commonsense QA, and summarization, achieving performance that rivals or surpasses instruction-tuned baselines and human-curated data in several settings. Across extensive experiments, CRAFT demonstrates robust data scaling, strong generalization to out-of-domain tasks, and greater stability compared with fully synthetic methods, though it exhibits limitations in recipe-generation scaling that motivate future quality-control mechanisms. The work provides a scalable, task-agnostic pipeline that reduces manual data curation while enabling domain-specific fine-tuning for diverse downstream tasks.

Abstract

Building high-quality datasets for specialized tasks is a time-consuming and resource-intensive process that often requires specialized domain knowledge. We propose Corpus Retrieval and Augmentation for Fine-Tuning (CRAFT), a method for generating synthetic datasets, given a small number of user-written few-shots that demonstrate the task to be performed. Given these examples, CRAFT uses large-scale public web-crawled corpora and similarity-based document retrieval to find other relevant human-written documents. Lastly, instruction-tuned large language models (LLMs) augment the retrieved documents into custom-formatted task samples, which then can be used for fine-tuning. We demonstrate that CRAFT can efficiently generate large-scale task-specific training datasets for four diverse tasks: biology, medicine, and commonsense question-answering (QA), as well as summarization. Our experiments show that CRAFT-based models outperform or match general LLMs on QA tasks, while exceeding models trained on human-curated summarization data by 46 preference points. CRAFT outperforms other synthetic dataset generation methods such as Self- and Evol-Instruct, and remains robust even when the quality of the initial few-shots varies.
Paper Structure (46 sections, 6 figures, 4 tables)

This paper contains 46 sections, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Synthetic dataset generation process. Given a small set of task-specific few-shots , we retrieve the top-k most similar free-text documents from an embedding database. Each document is then integrated into a task template alongside original few-shots and an instruction prompt. An instruction-tuned LLM generates new synthetic task samples by augmenting the content of the corpus samples to mimic the style of the few-shots. The transformation process for each numbered step is illustrated with example documents in Figure \ref{['fig:corpus-to-task_sample']}.
  • Figure 2: Step-by-step synthetic task sample generation process for BioQA. The color coding indicates where each section is reused throughout the process. For readability, we shorten text sections in this figure, indicated by "[…]". Few-shot design: the layout of a user-written few-shot sample that is used to guide the retrieval and task sample creation process. Corpus sample: a retrieved free-text document from the embedding database based on cosine similarity to the user-written few-shot. Few-shot task template: the prompting template that is used to augment the retrieved corpus sample into a synthetic task sample by using multiple few-shots as in-context examples. Synthetic task sample: this is an actual synthetic task sample that is generated from the corpus sample using the few-shot task template .
  • Figure 3: Performance scaling with increasing data size across multiple tasks using CRAFT with 32 few-shot examples. Graphs demonstrate consistent improvements as training data grows from few-shot (XS) to 25,000 synthetic samples (XL). CRAFT models consistently match or exceed Instruct performance (dotted red line). Shaded regions indicate standard deviation across three runs.
  • Figure 4: Performance comparison when CRAFT's retrieval process is initiated with standard human-curated (CRAFT), in-domain (-ID) and purely synthetic (-Synth) few-shots. As the dataset size increases, performance converges across the different few-shot sources, indicating that the retrieval and augmentation framework of CRAFT effectively abstracts away the variability in the quality of the initial few-shots.
  • Figure 5: Performance of CRAFT versus Evol-Instruct and Self-Instruct across tasks and dataset sizes with 8 few-shots. CRAFT shows better scaling and higher accuracy than both baselines in most settings.
  • ...and 1 more figures