Table of Contents
Fetching ...

GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables

Mathieu Huot, Matin Ghavami, Alexander K. Lew, Ulrich Schaechtle, Cameron E. Freer, Zane Shelby, Martin C. Rinard, Feras A. Saad, Vikash K. Mansinghka

TL;DR

GenSQL introduces a declarative extension of SQL that unifies access to probabilistic models of tabular data with standard database queries, enabling concise Bayesian workflows over databases. It formalizes a unified Abstract Model Interface (AMI) for models, provides a measure-theoretic denotational semantics, and proves soundness guarantees for both exact and approximate backends. The system lowers GenSQL queries to a target language that interacts with AMI-backed rowModels, offering normalization, a lowering transform, and proven guarantees of correctness and consistency under approximate inference. Empirical evaluation across runtime benchmarks and two real-world case studies demonstrates competitive performance (speedups of $1.7$–$6.8$x) and practical utility for anomaly detection and conditional synthetic data generation. The work emphasizes multi-language model integration, declarative querying, and reusable optimizations, positioning GenSQL as a practical bridge between databases and probabilistic modeling.

Abstract

This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.

GenSQL: A Probabilistic Programming System for Querying Generative Models of Database Tables

TL;DR

GenSQL introduces a declarative extension of SQL that unifies access to probabilistic models of tabular data with standard database queries, enabling concise Bayesian workflows over databases. It formalizes a unified Abstract Model Interface (AMI) for models, provides a measure-theoretic denotational semantics, and proves soundness guarantees for both exact and approximate backends. The system lowers GenSQL queries to a target language that interacts with AMI-backed rowModels, offering normalization, a lowering transform, and proven guarantees of correctness and consistency under approximate inference. Empirical evaluation across runtime benchmarks and two real-world case studies demonstrates competitive performance (speedups of x) and practical utility for anomaly detection and conditional synthetic data generation. The work emphasizes multi-language model integration, declarative querying, and reusable optimizations, positioning GenSQL as a practical bridge between databases and probabilistic modeling.

Abstract

This article presents GenSQL, a probabilistic programming system for querying probabilistic generative models of database tables. By augmenting SQL with only a few key primitives for querying probabilistic models, GenSQL enables complex Bayesian inference workflows to be concisely implemented. GenSQL's query planner rests on a unified programmatic interface for interacting with probabilistic models of tabular data, which makes it possible to use models written in a variety of probabilistic programming languages that are tailored to specific workflows. Probabilistic models may be automatically learned via probabilistic program synthesis, hand-designed, or a combination of both. GenSQL is formalized using a novel type system and denotational semantics, which together enable us to establish proofs that precisely characterize its soundness guarantees. We evaluate our system on two case real-world studies -- an anomaly detection in clinical trials and conditional synthetic data generation for a virtual wet lab -- and show that GenSQL more accurately captures the complexity of the data as compared to common baselines. We also show that the declarative syntax in GenSQL is more concise and less error-prone as compared to several alternatives. Finally, GenSQL delivers a 1.7-6.8x speedup compared to its closest competitor on a representative benchmark set and runs in comparable time to hand-written code, in part due to its reusable optimizations and code specialization.
Paper Structure (80 sections, 8 theorems, 56 equations, 29 figures, 1 table)

This paper contains 80 sections, 8 theorems, 56 equations, 29 figures, 1 table.

Key Result

theorem 1

Let $\Gamma, [] \vdash t: T[?\textsc{id}]\{\textsc{cols}\}$ be a safe query and suppose the AMI methods have asymptotically sound implementations. Then, for every evaluation of the context $\gamma$, $\mathbb{P}$-almost surely

Figures (29)

  • Figure 1: Overview of GenSQL.
  • Figure 2: Estimating the conditional mutual information between age and bmi given patient weights.
  • Figure 3: Syntax of GenSQL.
  • Figure 4: Type system of GenSQL.
  • Figure 5: Denotational semantics of GenSQL.
  • ...and 24 more figures

Theorems & Definitions (9)

  • theorem 1: Consistent AMI Guarantee
  • proposition 1: Correctness of caching (exact computations)
  • proposition 2: Correctness of caching (approximate computations)
  • proposition 3: Correctness of independence simplification
  • proposition 4
  • definition 1: Asymptotically Sound Approximate AMI Implementation
  • lemma 1
  • lemma 2: Fundamental lemma of logical relations
  • theorem 2: Consistent AMI Guarantee