A Survey of Data Synthesis Approaches
Hsin-Yu Chang, Pei-Yu Chen, Tun-Hsiang Chou, Chang-Sheng Kao, Hsuan-Yun Yu, Yen-Ting Lin, Yun-Nung Chen
TL;DR
The paper addresses the need to systematically categorize synthetic data generation and filtering across diverse ML tasks. It proposes a taxonomy of augmentation objectives (Diversity, Balancing, Domain Shift, Edge Cases) and a four-era view of techniques (Expert Knowledge, Direct Training, Pre-train then Fine-tune, Foundation Models without Fine-tuning) to organize the field. It also details a three-part post-processing framework (Basic Quality, Label Consistency, Data Distribution) and discusses future directions including quality emphasis, standardized evaluation, and multi-modal augmentation. The findings offer practical guidance for selecting augmentation strategies and highlight gaps in evaluation benchmarks and multi-modal handling. Overall, the survey provides a structured lens to advance synthetic data research and its deployment in real-world settings.
Abstract
This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.
