Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Thomas Rückstieß; Robin Vujanic

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Thomas Rückstieß, Robin Vujanic

TL;DR

Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end, and outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores.

Abstract

Synthetic data generation is a critical capability for data sharing, privacy compliance, system benchmarking and test data provisioning. Existing methods assume dense, fixed-schema tabular data, yet this assumption is increasingly at odds with modern data systems - from document databases, REST APIs to data lakes - which store and exchange data in sparse, semi-structured formats like JSON. Applying existing tabular methods to such data requires flattening of nested data into wide, sparse tables which scales poorly. We present Origami, an autoregressive transformer-based architecture that tokenizes data records, including nested objects and variable length arrays, into sequences of key, value and structural tokens. This representation natively handles sparsity, mixed types and hierarchical structure without flattening or imputation. Origami outperforms baselines spanning GAN, VAE, diffusion and autoregressive architectures on fidelity, utility and detection metrics across nearly all settings, while maintaining high privacy scores. On semi-structured datasets with up to 38% sparsity, baseline synthesizers either fail to scale or degrade substantially, while Origami maintains high-fidelity synthesis that is harder to distinguish from real data. To the best of our knowledge, Origami is the first architecture capable of natively modeling and generating semi-structured data end-to-end.

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

TL;DR

Abstract

Paper Structure (57 sections, 14 equations, 6 figures, 7 tables)

This paper contains 57 sections, 14 equations, 6 figures, 7 tables.

Introduction
Related Work
GAN and VAE-based Synthesis
Diffusion Synthesis
Autoregressive Synthesis
Synthesis beyond Single Tables
Sparsity and Missing Data
Architecture
Preprocessing
Tokenization
Input Representation
Key-Value Position Encoding (KVPE)
Numeric embedding
Transformer Backbone
Left-padding.
...and 42 more sections

Figures (6)

Figure 1: Tokenization of an example movies record.
Figure 2: origami dual-head model architecture with grammar and schema constraints imposed on the discrete head.
Figure 3: Flattening and type separation of two movie records. Nested objects and arrays are mapped to dot-separated columns; variable-length arrays and absent keys produce NaN. Mixed-type columns (awards.wins: integer vs. string) and partially-present columns are expanded into a type indicator (.dtype) and per-type value columns. Homogeneous, fully-present columns (title, genres.0, genres.1) pass through unchanged.
Figure 4: KDE visualizations of sparse numeric columns on the Electric Vehicles dataset.
Figure 5: KVPE vs. sequential position encoding on a synthetic nested JSON dataset. KVPE accurately recovers path-specific marginals while sequential PE collapses to uniform ($\approx 0.5$). Results over 3 seeds.
...and 1 more figures

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

TL;DR

Abstract

Autoregressive Synthesis of Sparse and Semi-Structured Mixed-Type Data

Authors

TL;DR

Abstract

Table of Contents

Figures (6)