Table of Contents
Fetching ...

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

Haris Riaz, Sourav Bhabesh, Vinayak Arannil, Miguel Ballesteros, Graham Horwood

TL;DR

Data scarcity and limited diversity in synthetic data hinder scalable LLM adaptation. MetaSynth introduces a memory-augmented meta-LM that orchestrates multiple expert agents to generate highly diverse synthetic documents and instructions, enabling efficient continual pre-training (CPT) for domain adaptation. Across Finance and Biomedicine, MetaSynth with 25M tokens improves Mistral-7B-v0.3 by up to 13.75% in Biomedicine and 4.08% in Finance while preserving general capabilities, with diversity metrics approaching pre-training corpora. The work also presents MetaSynth-Instruct for evolving instructions from synthetic documents and provides multi-faceted diversity assessments, highlighting both practical benefits and challenges like runtime costs and potential domain biases.

Abstract

Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

MetaSynth: Meta-Prompting-Driven Agentic Scaffolds for Diverse Synthetic Data Generation

TL;DR

Data scarcity and limited diversity in synthetic data hinder scalable LLM adaptation. MetaSynth introduces a memory-augmented meta-LM that orchestrates multiple expert agents to generate highly diverse synthetic documents and instructions, enabling efficient continual pre-training (CPT) for domain adaptation. Across Finance and Biomedicine, MetaSynth with 25M tokens improves Mistral-7B-v0.3 by up to 13.75% in Biomedicine and 4.08% in Finance while preserving general capabilities, with diversity metrics approaching pre-training corpora. The work also presents MetaSynth-Instruct for evolving instructions from synthetic documents and provides multi-faceted diversity assessments, highlighting both practical benefits and challenges like runtime costs and potential domain biases.

Abstract

Recent smaller language models such Phi-3.5 and Phi-4 rely on synthetic data generated using larger Language models. Questions remain about leveraging synthetic data for other use cases, such as adapting LLMs to specific domains. A key limitation of synthetic data is low diversity, which negatively impacts its downstream applicability for improving other models. To address this, we propose MetaSynth, a method for generating synthetic data that enhances diversity through meta-prompting, where a language model orchestrates multiple "expert" LLM agents to collaboratively generate data. Using only 25 million tokens of synthetic data generated with MetaSynth, we successfully adapt a well-trained LLM (Mistral-7B-v0.3) to two specialized domains-Finance and Biomedicine-without compromising the capabilities of the resulting model in general tasks. In addition, we evaluate the diversity of our synthetic data using seven automated metrics, and find that it approaches the diversity of LLM pre-training corpora. Continually pre-training Mistral-7B-v0.3 with MetaSynth notably outperforms the base LLM, showing improvements of up to 4.08% in Finance and 13.75% in Biomedicine. The same model shows degraded performance when trained on data generated using a template prompt, even when the template includes prior generations and varying In-Context exemplars of real data. Our findings suggest that a few million tokens of diverse synthetic data without mixing any real data, is sufficient for effective domain adaptation when using MetaSynth.

Paper Structure

This paper contains 55 sections, 4 equations, 20 figures, 10 tables, 2 algorithms.

Figures (20)

  • Figure 1: Demonstration of an example MetaSynth agentic workflow for synthesizing a financial document. A meta-LM orchestrates various expert agents that iteratively refine and generate diverse documents conditioned on an initial set of seed documents and previously synthesized documents. Refer to Section \ref{['meta-prompting-execution']} for a detailed description of the workflow.
  • Figure 2: Metrics are annotated with $\uparrow$ or $\downarrow$ arrows which indicate if higher or lower values are better, respectively. 1-GD refers to 1-Gram diversity and 4-GD refers to 4-Gram diversity. MIF refers to the Mean Inverse Frequency metric (refer to section \ref{['diversity-section']}). For a particular domain, diversity metrics for synthetic data generated using template prompting the base LLM are underlined as reference points. We include diversity metrics over a subset of Wikipedia as a generic example of a dataset regarded to be diverse. For each synthetic data generation method and each metric, percentage increases in diversity relative to template prompting are shown in parentheses. Improvements in measured diversity are highlighted in green and reductions in diversity are highlighted in red. All metrics are mean values of 95% CI computed with boostrap resampling (refer to Appendix \ref{['sec:boostrap-resampling']}). We control for length in all diversity comparisons by constraining synthetic documents to 400 words (Section \ref{['document-gen-seed-document-approach']}) and sampling from a similar-length distribution for other sources (e.g., Common Crawl, Wikipedia; Appendix \ref{['sec:appendixD']}).
  • Figure 3: Comparing the performance of BERT finetuned on data synthesized with template-prompting and MetaSynth versus real data on: (Left) FiQA-SA; (Middle) FPB; (Right) Headlines.
  • Figure 4: Finance Domain: Distribution of diversity metrics for documents synthesized by MetaSynth versus other types of documents (e.g., those generated with template-prompting or real data).
  • Figure 5: Biomedical Domain: Distribution of diversity metrics for documents synthesized by MetaSynth versus other types of documents (e.g., those generated with template-prompting or real data).
  • ...and 15 more figures