SQLSpace: A Representation Space for Text-to-SQL to Discover and Mitigate Robustness Gaps
Neha Srikanth, Victor Bursztyn, Puneet Mathur, Ani Nenkova
TL;DR
SQLSpace provides a compact, human-interpretable representation for NL2SQL examples by extracting 187 binary predicates from five aspect-based descriptions, enabling fine-grained benchmarking and model analysis beyond aggregate accuracy. The framework combines a semi-automatic description-generation step, predicate discovery, deduplication, and binary vector construction, producing representations that reveal benchmark composition, model blind spots, and potential for correctness-guided query rewriting. Through analyses of Spider-Dev, Bird-Dev, and Spider-Realistic, SQLSpace demonstrates how cluster-level and dataset-level insights can inform model selection, robustness benchmarking, and targeted data augmentation, with practical cost considerations and potential for online rewriting via a correctness estimator. The work highlights how interpretable representations can uncover systematic weaknesses, guide efficient model deployment, and motivate future improvements in NL2SQL robustness research and benchmark design.
Abstract
We introduce SQLSpace, a human-interpretable, generalizable, compact representation for text-to-SQL examples derived with minimal human intervention. We demonstrate the utility of these representations in evaluation with three use cases: (i) closely comparing and contrasting the composition of popular text-to-SQL benchmarks to identify unique dimensions of examples they evaluate, (ii) understanding model performance at a granular level beyond overall accuracy scores, and (iii) improving model performance through targeted query rewriting based on learned correctness estimation. We show that SQLSpace enables analysis that would be difficult with raw examples alone: it reveals compositional differences between benchmarks, exposes performance patterns obscured by accuracy alone, and supports modeling of query success.
