Table of Contents
Fetching ...

SEDGE: Structural Extrapolated Data Generation

Kun Zhang, Jiaqi Sun, Yiqing Li, Ignavier Ng, Namrata Deka, Shaoan Xie

Abstract

This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on the structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.

SEDGE: Structural Extrapolated Data Generation

Abstract

This paper proposes a framework for Structural Extrapolated Data GEneration (SEDGE) based on suitable assumptions on the underlying data generating process. We provide conditions under which data satisfying new specifications can be generated reliably, together with the approximate identifiability of the distribution of such data under certain ``conservative" assumptions. On the algorithmic side, we develop practical methods to achieve extrapolated data generation, based on the structure-informed optimization strategy or diffusion posterior sampling, respectively. We verify the extrapolation performance on synthetic data and also consider extrapolated image generation as a real-world scenario to illustrate the validity of the proposed framework.

Paper Structure

This paper contains 52 sections, 14 theorems, 46 equations, 11 figures, 2 tables, 2 algorithms.

Key Result

Proposition 1

Suppose that assumption:given_specifications_toy holds and the data generating process follows fig:setting_toy_two_specifications(a). Then, the novel data distribution $p(\mathbf{X} \,|\, Z_1 = 1, Z_2 = 1)$ is identifiable from the given data.

Figures (11)

  • Figure 1: The two generating processes as initial thoughts.
  • Figure 2: The generating processes where features generate the specification and the observed data satisfy the constraint that the selection variable $S = 1$.
  • Figure 3: Synthetic experiment results with given $\mathbf{X}$ and $\mathbf{Z}$. Panels (a–b) illustrate the data split in a two-dimensional view over the $X_1$ and $X_3$ axes. The novel specifications induce a previously unseen joint distribution over $(X_1, X_3)$, ensuring that successful performance requires true extrapolation rather than interpolation. Panels (c–h) present the optimization-based generation (OPT) results for all five models. Panel (h) shows the three-dimensional view of the Oracle $\mathbf{X}$. Panels (i–p) visualize the three-dimensional generated $\mathbf{X}$ using OPT and diffusion posterior sampling (DPS) for models A, B, and C, and OPT for models D and E, with the corresponding MMD values to the Oracle $\mathbf{X}$ reported.
  • Figure 4: Comparison of image generation results under differnet compositional prompts. Baseline models include SANA xie2024sana, Aligner conceptaligner, Stable Diffusion 3.5-Large esser2024scaling, QwenImage wu2025qwen, Z-Image cai2025z, and GPT5.2. Each row displays generated image for a prompt, and each column presents results from either our SEDGE or baseline models. Despite its relatively small size of 1.6B parameters, our method achieves performance comparable to GPT5.2 and outperforms the 20B QwenImage model, highlighting the effectiveness of our analysis for extrapolation.
  • Figure 5: Identification of Specifications ($Z[0]$ and $Z[1]$ corresponds to $Z1$ and $Z2$, respectively.)
  • ...and 6 more figures

Theorems & Definitions (21)

  • Proposition 1
  • Proposition 2
  • Proposition 3
  • Proposition 4
  • Proposition 4
  • proof
  • Proposition 4
  • proof
  • Proposition 4
  • proof
  • ...and 11 more