Table of Contents
Fetching ...

SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu

TL;DR

SpotIt introduces a verification-based evaluation pipeline for Text-to-SQL that uses bounded equivalence checking to actively seek databases differentiating generated and gold SQL queries, addressing the shortcomings of test-based evaluation. By extending VeriEQL to support a richer SQL subset with strings, dates, and type conversions, SpotIt provides rigorous correctness guarantees and uncovers semantic differences that static test databases miss. Experimental results on the BIRD dataset show substantial drops in reported accuracy and shifts in method rankings when evaluated with SpotIt, as well as diagnostic counterexamples that expose gold-SQL issues and NL ambiguities. The work highlights the need to re-evaluate benchmark reliability and suggests concrete steps toward more robust, user-informed evaluation frameworks for Text-to-SQL.

Abstract

Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.

SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification

TL;DR

SpotIt introduces a verification-based evaluation pipeline for Text-to-SQL that uses bounded equivalence checking to actively seek databases differentiating generated and gold SQL queries, addressing the shortcomings of test-based evaluation. By extending VeriEQL to support a richer SQL subset with strings, dates, and type conversions, SpotIt provides rigorous correctness guarantees and uncovers semantic differences that static test databases miss. Experimental results on the BIRD dataset show substantial drops in reported accuracy and shifts in method rankings when evaluated with SpotIt, as well as diagnostic counterexamples that expose gold-SQL issues and NL ambiguities. The work highlights the need to re-evaluate benchmark reliability and suggests concrete steps toward more robust, user-informed evaluation frameworks for Text-to-SQL.

Abstract

Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.

Paper Structure

This paper contains 16 sections, 7 theorems, 14 equations, 19 figures, 26 tables, 2 algorithms.

Key Result

Theorem 1

Let $D\xspace$ be a database over schema $\mathcal{S}$, $xs$ be a tuple list, and $E$ be an expression. Consider a symbolic database $\Gamma$ over $\mathcal{S}$, a list of symbolic tuples $\mathcal{T}$, and $E$'s symbolic encoding $\llbracket E \rrbracket_{\mathcal{S}, \Gamma, \mathcal{T}}$. For any

Figures (19)

  • Figure 1: Examples of cases where the generated SQL produces the same output as the gold SQL on the BIRD's official test database, but SpotIt finds a database that differentiates the the queries. The parts that explain the mismatch are highlighted. For $N\xspace_{1}$, the gold SQL is incorrect. And for $N\xspace_{2}$, both SQL queries can be right depending on the interpretation of the NL question.
  • Figure 2: Extended syntax of SQL Queries. New features are in bold.
  • Figure 3: Three main phases of $\textsc{SpotIt}\xspace$.
  • Figure 4: A breakdown of the primary reason for the difference between generated and gold SQLs.
  • Figure 5: A breakdown of questions that passed EX-test but failed SpotIt+.
  • ...and 14 more figures

Theorems & Definitions (22)

  • Example 3.1
  • Example 3.2
  • Example 4.1
  • Theorem 1: Correctness of expression encoding
  • Theorem 2: Equivalence under set semantics
  • Example B.1
  • Example B.2
  • Example B.3
  • Example B.4
  • Example B.5
  • ...and 12 more