Table of Contents
Fetching ...

A Survey of Data Synthesis Approaches

Hsin-Yu Chang, Pei-Yu Chen, Tun-Hsiang Chou, Chang-Sheng Kao, Hsuan-Yun Yu, Yen-Ting Lin, Yun-Nung Chen

TL;DR

The paper addresses the need to systematically categorize synthetic data generation and filtering across diverse ML tasks. It proposes a taxonomy of augmentation objectives (Diversity, Balancing, Domain Shift, Edge Cases) and a four-era view of techniques (Expert Knowledge, Direct Training, Pre-train then Fine-tune, Foundation Models without Fine-tuning) to organize the field. It also details a three-part post-processing framework (Basic Quality, Label Consistency, Data Distribution) and discusses future directions including quality emphasis, standardized evaluation, and multi-modal augmentation. The findings offer practical guidance for selecting augmentation strategies and highlight gaps in evaluation benchmarks and multi-modal handling. Overall, the survey provides a structured lens to advance synthetic data research and its deployment in real-world settings.

Abstract

This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.

A Survey of Data Synthesis Approaches

TL;DR

The paper addresses the need to systematically categorize synthetic data generation and filtering across diverse ML tasks. It proposes a taxonomy of augmentation objectives (Diversity, Balancing, Domain Shift, Edge Cases) and a four-era view of techniques (Expert Knowledge, Direct Training, Pre-train then Fine-tune, Foundation Models without Fine-tuning) to organize the field. It also details a three-part post-processing framework (Basic Quality, Label Consistency, Data Distribution) and discusses future directions including quality emphasis, standardized evaluation, and multi-modal augmentation. The findings offer practical guidance for selecting augmentation strategies and highlight gaps in evaluation benchmarks and multi-modal handling. Overall, the survey provides a structured lens to advance synthetic data research and its deployment in real-world settings.

Abstract

This paper provides a detailed survey of synthetic data techniques. We first discuss the expected goals of using synthetic data in data augmentation, which can be divided into four parts: 1) Improving Diversity, 2) Data Balancing, 3) Addressing Domain Shift, and 4) Resolving Edge Cases. Synthesizing data are closely related to the prevailing machine learning techniques at the time, therefore, we summarize the domain of synthetic data techniques into four categories: 1) Expert-knowledge, 2) Direct Training, 3) Pre-train then Fine-tune, and 4) Foundation Models without Fine-tuning. Next, we categorize the goals of synthetic data filtering into four types for discussion: 1) Basic Quality, 2) Label Consistency, and 3) Data Distribution. In section 5 of this paper, we also discuss the future directions of synthetic data and state three direction that we believe is important: 1) focus more on quality, 2) the evaluation of synthetic data, and 3) multi-model data augmentation.
Paper Structure (21 sections, 1 figure, 2 tables)

This paper contains 21 sections, 1 figure, 2 tables.

Figures (1)

  • Figure 1: Four approaches to generate synthetic data: 1) Expert-Knowledge, 2) Direct Training, 3) Pre-train then Finetune, and 4) Foundation Models without Fine-tuning. Each approaches are discussed detailedly in section 3.