Table of Contents
Fetching ...

ORIGAMI: A generative transformer architecture for predictions from semi-structured data

Thomas Rückstieß, Alana Huang, Robin Vujanic

TL;DR

ORIGAMI addresses learning directly from semi-structured data like JSON without flattening. It treats JSON as token sequences using a structure-preserving tokenizer, a key/value position encoding (KVPE) that is invariant to key order, and a pushdown automaton based guardrail to enforce grammar during training and inference. The model is a decoder-only Transformer trained to predict the next token, effectively modeling the joint distribution by autoregressive factorization $p({\bm{x}}) = \prod_{i=1}^n p(x_i \mid x_{<i})$, enabling end-to-end multi-label and single-label classification. Across JSONified tabular datasets, DDXPlus, and CodeNet Java250, ORIGAMI achieves competitive or superior results to baselines including classical methods and some large pretrained models. The results show that a structure-aware inductive bias plus grammar-constrained decoding can unlock effective learning on semi-structured data and suggests directions for unsupervised learning and programming language tasks.

Abstract

Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.

ORIGAMI: A generative transformer architecture for predictions from semi-structured data

TL;DR

ORIGAMI addresses learning directly from semi-structured data like JSON without flattening. It treats JSON as token sequences using a structure-preserving tokenizer, a key/value position encoding (KVPE) that is invariant to key order, and a pushdown automaton based guardrail to enforce grammar during training and inference. The model is a decoder-only Transformer trained to predict the next token, effectively modeling the joint distribution by autoregressive factorization , enabling end-to-end multi-label and single-label classification. Across JSONified tabular datasets, DDXPlus, and CodeNet Java250, ORIGAMI achieves competitive or superior results to baselines including classical methods and some large pretrained models. The results show that a structure-aware inductive bias plus grammar-constrained decoding can unlock effective learning on semi-structured data and suggests directions for unsupervised learning and programming language tasks.

Abstract

Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.

Paper Structure

This paper contains 30 sections, 2 equations, 8 figures, 10 tables.

Figures (8)

  • Figure 1: Overall architecture of origami consisting of preprocessing of documents into integer sequences (a) and model architecture for training and inference (b).
  • Figure 2: PDA transition diagram, where edges are labeled with $\sigma, \gamma_\text{pop} \rightarrow \gamma_\text{push}$. These transitions read input token $\sigma \in \Sigma$ and manipulate the stack by popping symbol $\gamma_\text{pop}$ and pushing symbol $\gamma_\text{push}$.
  • Figure 3: Evolution of stack states when parsing the token sequence of the example movies JSON instance from Fig. \ref{['fig:prep+arch']}.
  • Figure 4: Varying levels of data upscaling on the contraceptive dataset. We observe that low upscaling factors lead to overfitting on the training data, with typical increase of test loss after initial drop. With increasing upscaling factors beyond 5x, this phenomenon is mitigated and test accuracy generally improves.
  • Figure 5: The Dungeons synthetic dataset. The corridor array contains between 4 and 8 objects, each has a door_no key, 3 color-coded keys (red_key, green_key, blue_key), and between 0 and 2 monsters, randomly selected. The monsters, if present, add further randomness to the positions of the informative tokens, but otherwise have no effect on the target. Two top-level keys, door and key_color provide clues which of the corridor objects contains the correct answer for the target key treasure. To find the correct label, one has to locate the corridor object with the correct door_no and within that object retrieve the correct treasure based on key_color. There are 5 different treasure types, assigned uniformly at random, with a 20% chance of guessing correctly.
  • ...and 3 more figures