GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables
Mathieu Huot, Matin Ghavami, Alexander K. Lew, Ulrich Schaechtle, Cameron E. Freer, Zane Shelby, Martin C. Rinard, Feras A. Saad, Vikash K. Mansinghka
TL;DR
GenSQL introduces a declarative extension of SQL that unifies access to probabilistic models of tabular data with standard database queries, enabling concise Bayesian workflows over databases. It formalizes a unified Abstract Model Interface (AMI) for models, provides a measure-theoretic denotational semantics, and proves soundness guarantees for both exact and approximate backends. The system lowers GenSQL queries to a target language that interacts with AMI-backed rowModels, offering normalization, a lowering transform, and proven guarantees of correctness and consistency under approximate inference. Empirical evaluation across runtime benchmarks and two real-world case studies demonstrates competitive performance (speedups of $1.7$–$6.8$x) and practical utility for anomaly detection and conditional synthetic data generation. The work emphasizes multi-language model integration, declarative querying, and reusable optimizations, positioning GenSQL as a practical bridge between databases and probabilistic modeling.
Abstract
This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
