Table of Contents
Fetching ...

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Yifang Chen, David Zhu, Simon Du, Kevin Jamieson, Yang Liu

TL;DR

This work treats data synthesis for LLMs as a distinct training task and introduces NOMAD, a paradigm that emphasizes no-prompt-masked training and careful training-data selection to improve synthetic data quality. Across small and moderate data regimes, NOMAD–especially when using a no-prompt-masked synthesis model and filtered data–achieves notable gains on evaluation tasks such as TriviaQA and GSM8K, demonstrating that traditional prompt-masked SFT for data generation can be detrimental. A key contribution is the analysis of synthetic data quality via the NormSim similarity metric and the framing of data quality in terms of relevance and novelty, showing that median similarity to the target distribution yields the best performance and that larger synthetic datasets can harm novelty. The findings offer practical guidance for constructing synthetic data pipelines and interpreting synthetic data, with implications for improving data efficiency in LLM instruction-tuning and shaping future exploration of data-centric training strategies.

Abstract

Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

TL;DR

This work treats data synthesis for LLMs as a distinct training task and introduces NOMAD, a paradigm that emphasizes no-prompt-masked training and careful training-data selection to improve synthetic data quality. Across small and moderate data regimes, NOMAD–especially when using a no-prompt-masked synthesis model and filtered data–achieves notable gains on evaluation tasks such as TriviaQA and GSM8K, demonstrating that traditional prompt-masked SFT for data generation can be detrimental. A key contribution is the analysis of synthetic data quality via the NormSim similarity metric and the framing of data quality in terms of relevance and novelty, showing that median similarity to the target distribution yields the best performance and that larger synthetic datasets can harm novelty. The findings offer practical guidance for constructing synthetic data pipelines and interpreting synthetic data, with implications for improving data efficiency in LLM instruction-tuning and shaping future exploration of data-centric training strategies.

Abstract

Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".

Paper Structure

This paper contains 46 sections, 1 equation, 2 figures, 4 tables.

Figures (2)

  • Figure 1: Similarity curves for prompts (left) and responses (right). The y-axis represents the proportion of $X_\text{synthesis}$ above a certain similarity threshold. For prompts, masked training results show significantly lower similarity to the original TULU compared to unmasked training. Among unmasked cases, using the full 300K dataset for synthetic model training yields the highest similarity to original TULU. Response similarity shows smaller gaps across training methods, which is expected as both approaches compute loss on responses.
  • Figure 2: Train $M_{s}$ on $X_\text{synthesis}$ alone vs. on mixture. We study the correlation between training the $M_{s}$ on $X_\text{synthesis}$ alone (x-axis) and training on the mixture of $X_\text{synthesis}$ + $X_\text{train}$ (y-axis) on two most tensive metrics gsm8k (top) and bbh-nocot-fs (bottom). The performances includes different cases with 15K or 300K $X_\text{train}$ , masked or no-masked training.