Table of Contents
Fetching ...

VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints

Yang He, Pinhan Zhao, Xinyu Wang, Yuepeng Wang

TL;DR

VeriEQL tackles the practical problem of bounded SQL query equivalence under integrity constraints by introducing an SMT-based encoding that models query semantics with symbolic tuples over integers and uninterpreted functions. It formalizes a larger, real-world fragment of SQL (including ORDER BY, CASE, WITH, and complex constraints) and proves correctness with respect to a three-valued, NULL-aware semantics. The approach builds a symbolic database under a bound, encodes integrity constraints and query operators into a single SMT formula, and uses SAT/SMT solving to either certify bounded equivalence or produce counterexamples. Empirically, VeriEQL dramatically outperforms state-of-the-art baselines across 24,455 benchmark pairs, proving or disproving significantly more cases and revealing real bugs in systems like MySQL and Calcite. The work demonstrates the practical impact of formal equivalence checking for real-world SQL, including automated debugging, test generation, and validation of rewrite rules.

Abstract

The task of SQL query equivalence checking is important in various real-world applications (including query rewriting and automated grading) that involve complex queries with integrity constraints; yet, state-of-the-art techniques are very limited in their capability of reasoning about complex features (e.g., those that involve sorting, case statement, rich integrity constraints, etc.) in real-life queries. To the best of our knowledge, we propose the first SMT-based approach and its implementation, VeriEQL, capable of proving and disproving bounded equivalence of complex SQL queries. VeriEQL is based on a new logical encoding that models query semantics over symbolic tuples using the theory of integers with uninterpreted functions. It is simple yet highly practical -- our comprehensive evaluation on over 20,000 benchmarks shows that VeriEQL outperforms all state-of-the-art techniques by more than one order of magnitude in terms of the number of benchmarks that can be proved or disproved. VeriEQL can also generate counterexamples that facilitate many downstream tasks (such as finding serious bugs in systems like MySQL and Apache Calcite).

VeriEQL: Bounded Equivalence Verification for Complex SQL Queries with Integrity Constraints

TL;DR

VeriEQL tackles the practical problem of bounded SQL query equivalence under integrity constraints by introducing an SMT-based encoding that models query semantics with symbolic tuples over integers and uninterpreted functions. It formalizes a larger, real-world fragment of SQL (including ORDER BY, CASE, WITH, and complex constraints) and proves correctness with respect to a three-valued, NULL-aware semantics. The approach builds a symbolic database under a bound, encodes integrity constraints and query operators into a single SMT formula, and uses SAT/SMT solving to either certify bounded equivalence or produce counterexamples. Empirically, VeriEQL dramatically outperforms state-of-the-art baselines across 24,455 benchmark pairs, proving or disproving significantly more cases and revealing real bugs in systems like MySQL and Calcite. The work demonstrates the practical impact of formal equivalence checking for real-world SQL, including automated debugging, test generation, and validation of rewrite rules.

Abstract

The task of SQL query equivalence checking is important in various real-world applications (including query rewriting and automated grading) that involve complex queries with integrity constraints; yet, state-of-the-art techniques are very limited in their capability of reasoning about complex features (e.g., those that involve sorting, case statement, rich integrity constraints, etc.) in real-life queries. To the best of our knowledge, we propose the first SMT-based approach and its implementation, VeriEQL, capable of proving and disproving bounded equivalence of complex SQL queries. VeriEQL is based on a new logical encoding that models query semantics over symbolic tuples using the theory of integers with uninterpreted functions. It is simple yet highly practical -- our comprehensive evaluation on over 20,000 benchmarks shows that VeriEQL outperforms all state-of-the-art techniques by more than one order of magnitude in terms of the number of benchmarks that can be proved or disproved. VeriEQL can also generate counterexamples that facilitate many downstream tasks (such as finding serious bugs in systems like MySQL and Apache Calcite).
Paper Structure (32 sections, 10 theorems, 39 equations, 24 figures, 1 table, 1 algorithm)

This paper contains 32 sections, 10 theorems, 39 equations, 24 figures, 1 table, 1 algorithm.

Key Result

lemma 1

The proof of all lemmas and theorems can be found in Appendix sec:proof. Given a symbolic database $\Gamma$ and an integrity constraint $\mathcal{C}$, consider a formula $\Phi$ such that $\Gamma \vdash \mathcal{C} \rightsquigarrow \Phi$. If $\Phi$ is satisfiable, then the model of $\Phi$ corresponds

Figures (24)

  • Figure 1: Schematic workflow of $\textsc{VeriEQL}\xspace$.
  • Figure 2: Illustration of how $\textsc{VeriEQL}\xspace$ works on a simple LeetCode task.
  • Figure 3: Relational schema and database.
  • Figure 4: Syntax of SQL Queries. Values include integers, bools, and Null. $id(A)$ is a construct denoting attribute $A$ itself occurs in the attribute list. If the context is clear, we may also omit the $id$ constructor for brevity. The $\text{Cast}$ function takes as input a predicate $\phi$ and returns Null if $\phi$ is $\text{Null}\xspace$, 1 if $\phi$ is $\top$, and 0 otherwise.
  • Figure 5: Formal semantics of SQL queries. ite is the standard if-then-else function. $\textsf{hasAgg}(L)$ checks if the attribute list $L$ has an attribute computing an aggregation. $\textsf{merge}\xspace(x, y)$ merges two tuples $x$ and $y$ into one tuple. $\textsf{rename}\xspace(R, n)$ replaces all occurrences of the original table name $R$ with the new globally unique name $n$. $T_{\text{Null}\xspace}$ represents a tuple of Null's, whose length is determined as appropriate by the context for brevity. Note that $xs - x$ on the right-hand side of $\llbracket Q_1 - Q_2 \rrbracket_\mathcal{D}$ denotes deleting one tuple $x$ in $xs$ iff $xs \in x$. Full formal semantics of all constructs in our query language, including predicates $\llbracket \phi \rrbracket_{\mathcal{D}, xs}$, attribute lists $\llbracket L \rrbracket_{\mathcal{D}, xs}$, and expressions $\llbracket e \rrbracket_{\mathcal{D}, xs}$ is available in Appendix \ref{['sec:full-semantics']} .
  • ...and 19 more figures

Theorems & Definitions (14)

  • definition 1: Conformance between Database and Schema
  • definition 2: Bounded Equivalence modulo Integrity Constraint
  • lemma 1
  • definition 3: Interpretation
  • definition 4: Interpretation Extension
  • theorem 1
  • theorem 2
  • lemma 2
  • theorem 3
  • lemma 3
  • ...and 4 more