Autoregressive Models for Knowledge Graph Generation
Thiviyan Thanapalasingam, Antonis Vozikis, Peter Bloem, Paul Groth
TL;DR
The paper tackles knowledge graph generation by learning a joint distribution $p_\theta(G)$ over subgraphs, enabling semantic constraints without explicit rule supervision. It introduces ARK, an autoregressive approach that linearizes graphs as sequences of triples and generates $p_\theta(G)$ token-by-token, and SAIL, a variational extension that enables controlled generation and interpolation in latent space. Across IntelliGraphs, ARK and SAIL achieve high semantic validity (near 100%), strong novelty, and efficient compression, outperforming independent-triple KGE baselines. The work shows that model capacity, especially hidden dimensionality $d_{\text{model}} \ge 64$, matters more than depth, and the GRU-based decoders offer favorable efficiency for KG generation with practical implications for knowledge base augmentation and query answering.
Abstract
Knowledge Graph (KG) generation requires models to learn complex semantic dependencies between triples while maintaining domain validity constraints. Unlike link prediction, which scores triples independently, generative models must capture interdependencies across entire subgraphs to produce semantically coherent structures. We present ARK (Auto-Regressive Knowledge Graph Generation), a family of autoregressive models that generate KGs by treating graphs as sequences of (head, relation, tail) triples. ARK learns implicit semantic constraints directly from data, including type consistency, temporal validity, and relational patterns, without explicit rule supervision. On the IntelliGraphs benchmark, our models achieve 89.2% to 100.0% semantic validity across diverse datasets while generating novel graphs not seen during training. We also introduce SAIL, a variational extension of ARK that enables controlled generation through learned latent representations, supporting both unconditional sampling and conditional completion from partial graphs. Our analysis reveals that model capacity (hidden dimensionality >= 64) is more critical than architectural depth for KG generation, with recurrent architectures achieving comparable validity to transformer-based alternatives while offering substantial computational efficiency. These results demonstrate that autoregressive models provide an effective framework for KG generation, with practical applications in knowledge base completion and query answering.
