Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Yifang Chen; David Zhu; Simon Du; Kevin Jamieson; Yang Liu

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

Yifang Chen, David Zhu, Simon Du, Kevin Jamieson, Yang Liu

TL;DR

This work treats data synthesis for LLMs as a distinct training task and introduces NOMAD, a paradigm that emphasizes no-prompt-masked training and careful training-data selection to improve synthetic data quality. Across small and moderate data regimes, NOMAD–especially when using a no-prompt-masked synthesis model and filtered data–achieves notable gains on evaluation tasks such as TriviaQA and GSM8K, demonstrating that traditional prompt-masked SFT for data generation can be detrimental. A key contribution is the analysis of synthetic data quality via the NormSim similarity metric and the framing of data quality in terms of relevance and novelty, showing that median similarity to the target distribution yields the best performance and that larger synthetic datasets can harm novelty. The findings offer practical guidance for constructing synthetic data pipelines and interpreting synthetic data, with implications for improving data efficiency in LLM instruction-tuning and shaping future exploration of data-centric training strategies.

Abstract

Recent advances in large language model (LLM) training have highlighted the need for diverse, high-quality instruction data. Recently, many works are exploring synthetic data generation using LLMs. However, they primarily focus on prompt engineering with standard supervised instruction-finetuned models, which contains a fundamental limitation: these models are optimized for general question-answering/problem-solving rather than data generation. We propose a paradigm shift named \textbf{NOMAD} by investigating how to specifically train models for data generation, demonstrating that this task differs significantly from training a classical LM. We identify two key factors: no-prompt-masked training and proper training set size selection. Our method, NOMAD, shows substantial improvements over baselines, achieving >4\% gains in TriviaQA and >2\% in GSM8K with limited training data. Finally, we offer new insights by interpreting synthetic data through the lenses of "relevance" and "novelty".

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

TL;DR

Abstract

Rethinking Data Synthesis: A Teacher Model Training Recipe with Interpretation

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (2)