Table of Contents
Fetching ...

Reasoning-Driven Synthetic Data Generation and Evaluation

Tim R. Davidson, Benoit Seguin, Enrico Bacis, Cesar Ilharco, Hamza Harkous

Abstract

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

Reasoning-Driven Synthetic Data Generation and Evaluation

Abstract

Although many AI applications of interest require specialized multi-modal models, relevant data to train such models is inherently scarce or inaccessible. Filling these gaps with human annotators is prohibitively expensive, error-prone, and time-consuming, leading model builders to increasingly consider synthetic data as a scalable alternative. However, existing synthetic data generation methods often rely on manual prompts, evolutionary algorithms, or extensive seed data from the target distribution - limiting their scalability, explainability, and control. In this paper, we introduce Simula: a novel reasoning-driven framework for data generation and evaluation. It employs a seedless, agentic approach to generate synthetic datasets at scale, allowing users to define desired dataset characteristics through an explainable and controllable process that enables fine-grained resource allocation. We show the efficacy of our approach on a variety of datasets, rigorously testing both intrinsic and downstream properties. Our work (1) offers guidelines for synthetic data mechanism design, (2) provides insights into generating and evaluating synthetic data at scale, and (3) unlocks new opportunities for developing and deploying AI in domains where data scarcity or privacy concerns are paramount.

Paper Structure

This paper contains 37 sections, 1 equation, 13 figures, 6 tables, 3 algorithms.

Figures (13)

  • Figure 1: Synthetic Coverage Examples. The outer squares of (a-b-c) represent a semantic factor of interest, e.g., "cat type." (a) Characterizes "random" sampling behavior, with no notion of a global coverage space. This often results in samples clustered around semantic modes and misses edge cases. The grid-like structures in (b-c) represent discrete semantic spaces defined by a taxonomy's leaf nodes at increasing levels of granularity. For example, the first level could represent "cat type" broken down into "domestic, big wild cats, small wild cats, and feral", whereas a square at a lower level might represent a specific cat breed like the "British shorthair." (b) Represents perfect global planning at increasing granularity; and (c) shows global planning with progressive coverage loss, e.g., missing the branch "big wild cats" entirely (bottom left) or missing specific breeds (bottom right).
  • Figure 2: Schematic of Simula Framework. Given user instructions $y$ and/or a data sample $\mathcal{S}$, we first (a) determine factors of interest $f_i$, which (b) are expanded into taxonomies $\mathcal{T}_i$. Next, (c) nodes of $\mathcal{T}_i$ are sampled to obtain mixes, and (d) turned into "meta prompts". A user-defined fraction, $c$, of meta prompts is "complexified." (e) Finally meta prompts are used to generate data proposals by a Generator, and iteratively refined using a Critic step.
  • Figure 3: Double-Critic Rejection Sampling on MATH. (a) We establish the theoretical lift in the controlled setting, noting that rejection costs increase with task complexity. (b) Critic capabilities transport to the empirical setting but lose some effectiveness. (c) Calibrated Elo scores show model-human alignment. Stratified by complexity level, rejected samples are consistently assigned higher model complexity.
  • Figure 4: Intrinsic Diversity Metrics. We display dataset-wide (top) and nearest-neighbors (middle) embedding-based diversity, and taxonomic coverage (bottom). We note that Global diversification is crucial for increasing dataset-wide diversity, and that Local and Global diversification generally have an additive effect. We further note that while real data can be more or less diverse according to embedding-based metrics, it almost always covers less of the target domain than Simula variants on a taxonomy basis.
  • Figure 5: Complexity Elo Distribution of Synthetic and Real Data. We display density plots of the complexity Elo rankings for the various system versions on four datasets. We note that synthetic data can cover the entire complexity range of all datasets and that Local and Global components are generally additive, i.e., they account for different types of complexity.
  • ...and 8 more figures