SpotIt: Evaluating Text-to-SQL Evaluation with Formal Verification
Rocky Klopfenstein, Yang He, Andrew Tremante, Yuepeng Wang, Nina Narodytska, Haoze Wu
TL;DR
SpotIt introduces a verification-based evaluation pipeline for Text-to-SQL that uses bounded equivalence checking to actively seek databases differentiating generated and gold SQL queries, addressing the shortcomings of test-based evaluation. By extending VeriEQL to support a richer SQL subset with strings, dates, and type conversions, SpotIt provides rigorous correctness guarantees and uncovers semantic differences that static test databases miss. Experimental results on the BIRD dataset show substantial drops in reported accuracy and shifts in method rankings when evaluated with SpotIt, as well as diagnostic counterexamples that expose gold-SQL issues and NL ambiguities. The work highlights the need to re-evaluate benchmark reliability and suggests concrete steps toward more robust, user-informed evaluation frameworks for Text-to-SQL.
Abstract
Community-driven Text-to-SQL evaluation platforms play a pivotal role in tracking the state of the art of Text-to-SQL performance. The reliability of the evaluation process is critical for driving progress in the field. Current evaluation methods are largely test-based, which involves comparing the execution results of a generated SQL query and a human-labeled ground-truth on a static test database. Such an evaluation is optimistic, as two queries can coincidentally produce the same output on the test database while actually being different. In this work, we propose a new alternative evaluation pipeline, called SpotIt, where a formal bounded equivalence verification engine actively searches for a database that differentiates the generated and ground-truth SQL queries. We develop techniques to extend existing verifiers to support a richer SQL subset relevant to Text-to-SQL. A performance evaluation of ten Text-to-SQL methods on the high-profile BIRD dataset suggests that test-based methods can often overlook differences between the generated query and the ground-truth. Further analysis of the verification results reveals a more complex picture of the current Text-to-SQL evaluation.
