Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness
Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang
TL;DR
Dr.Spider presents a diagnostics-centered robustness benchmark for text-to-SQL that extends the Spider dataset with 17 task-specific perturbations across databases, natural language questions, and SQL queries, totaling about 15K perturbed examples. By leveraging crowdsourcing, expert analysis, and large pre-trained language models, the authors generate diverse NLQ perturbations and programmatic SQL perturbations, enabling fine-grained evaluation of state-of-the-art models across input modalities. Across a broad set of models (including RatSQL, GraPPa, SmBop, various T5 variants, Picard, and Codex in-context) the benchmark reveals meaningful robustness gaps, with DB and NLQ perturbations driving the largest performance drops, and SQL perturbations mostly being more forgiving. The work also analyzes design choices such as model size, decoders, and entity linking, offering practical guidance for building more robust text-to-SQL systems and outlining future data-augmentation and hybrid-decoder strategies. Overall, Dr.Spider provides a scalable, linguistically rich, and cross-domain framework to quantify and improve text-to-SQL robustness in realistic settings.
Abstract
Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
