ORIGAMI: A generative transformer architecture for predictions from semi-structured data
Thomas Rückstieß, Alana Huang, Robin Vujanic
TL;DR
ORIGAMI addresses learning directly from semi-structured data like JSON without flattening. It treats JSON as token sequences using a structure-preserving tokenizer, a key/value position encoding (KVPE) that is invariant to key order, and a pushdown automaton based guardrail to enforce grammar during training and inference. The model is a decoder-only Transformer trained to predict the next token, effectively modeling the joint distribution by autoregressive factorization $p({\bm{x}}) = \prod_{i=1}^n p(x_i \mid x_{<i})$, enabling end-to-end multi-label and single-label classification. Across JSONified tabular datasets, DDXPlus, and CodeNet Java250, ORIGAMI achieves competitive or superior results to baselines including classical methods and some large pretrained models. The results show that a structure-aware inductive bias plus grammar-constrained decoding can unlock effective learning on semi-structured data and suggests directions for unsupervised learning and programming language tasks.
Abstract
Despite the popularity and widespread use of semi-structured data formats such as JSON, end-to-end supervised learning applied directly to such data remains underexplored. We present ORIGAMI (Object RepresentatIon via Generative Autoregressive ModellIng), a transformer-based architecture that directly processes nested key/value pairs while preserving their hierarchical semantics. Our key technical contributions include: (1) a structure-preserving tokenizer, (2) a novel key/value position encoding scheme, and (3) a grammar-constrained training and inference framework that ensures valid outputs and accelerates training convergence. These enhancements enable efficient end-to-end modeling of semi-structured data. By reformulating classification as next-token prediction, ORIGAMI naturally handles both single-label and multi-label tasks without architectural modifications. Empirical evaluation across diverse domains demonstrates ORIGAMI's effectiveness: On standard tabular benchmarks converted to JSON, ORIGAMI remains competitive with classical and state-of-the-art approaches. On native JSON datasets, we outperform baselines on multi-label classification and specialized models such as convolutional and graph neural networks on a code classification task. Through extensive ablation studies, we validate the impact of each architectural component and establish ORIGAMI as a robust framework for end-to-end learning on semi-structured data.
