Table of Contents
Fetching ...

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

Shuaichen Chang, Jun Wang, Mingwen Dong, Lin Pan, Henghui Zhu, Alexander Hanbo Li, Wuwei Lan, Sheng Zhang, Jiarong Jiang, Joseph Lilien, Steve Ash, William Yang Wang, Zhiguo Wang, Vittorio Castelli, Patrick Ng, Bing Xiang

TL;DR

Dr.Spider presents a diagnostics-centered robustness benchmark for text-to-SQL that extends the Spider dataset with 17 task-specific perturbations across databases, natural language questions, and SQL queries, totaling about 15K perturbed examples. By leveraging crowdsourcing, expert analysis, and large pre-trained language models, the authors generate diverse NLQ perturbations and programmatic SQL perturbations, enabling fine-grained evaluation of state-of-the-art models across input modalities. Across a broad set of models (including RatSQL, GraPPa, SmBop, various T5 variants, Picard, and Codex in-context) the benchmark reveals meaningful robustness gaps, with DB and NLQ perturbations driving the largest performance drops, and SQL perturbations mostly being more forgiving. The work also analyzes design choices such as model size, decoders, and entity linking, offering practical guidance for building more robust text-to-SQL systems and outlining future data-augmentation and hybrid-decoder strategies. Overall, Dr.Spider provides a scalable, linguistically rich, and cross-domain framework to quantify and improve text-to-SQL robustness in realistic settings.

Abstract

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.

Dr.Spider: A Diagnostic Evaluation Benchmark towards Text-to-SQL Robustness

TL;DR

Dr.Spider presents a diagnostics-centered robustness benchmark for text-to-SQL that extends the Spider dataset with 17 task-specific perturbations across databases, natural language questions, and SQL queries, totaling about 15K perturbed examples. By leveraging crowdsourcing, expert analysis, and large pre-trained language models, the authors generate diverse NLQ perturbations and programmatic SQL perturbations, enabling fine-grained evaluation of state-of-the-art models across input modalities. Across a broad set of models (including RatSQL, GraPPa, SmBop, various T5 variants, Picard, and Codex in-context) the benchmark reveals meaningful robustness gaps, with DB and NLQ perturbations driving the largest performance drops, and SQL perturbations mostly being more forgiving. The work also analyzes design choices such as model size, decoders, and entity linking, offering practical guidance for building more robust text-to-SQL systems and outlining future data-augmentation and hybrid-decoder strategies. Overall, Dr.Spider provides a scalable, linguistically rich, and cross-domain framework to quantify and improve text-to-SQL robustness in realistic settings.

Abstract

Neural text-to-SQL models have achieved remarkable performance in translating natural language questions into SQL queries. However, recent studies reveal that text-to-SQL models are vulnerable to task-specific perturbations. Previous curated robustness test sets usually focus on individual phenomena. In this paper, we propose a comprehensive robustness benchmark based on Spider, a cross-domain text-to-SQL benchmark, to diagnose the model robustness. We design 17 perturbations on databases, natural language questions, and SQL queries to measure the robustness from different angles. In order to collect more diversified natural question perturbations, we utilize large pretrained language models (PLMs) to simulate human behaviors in creating natural questions. We conduct a diagnostic study of the state-of-the-art models on the robustness set. Experimental results reveal that even the most robust model suffers from a 14.0% performance drop overall and a 50.7% performance drop on the most challenging perturbation. We also present a breakdown analysis regarding text-to-SQL model designs and provide insights for improving model robustness.
Paper Structure (33 sections, 7 figures, 16 tables)

This paper contains 33 sections, 7 figures, 16 tables.

Figures (7)

  • Figure 1: An example of the SOTA model Picard scholak2021picard against DB, NLQ, SQL perturbations on the database WTA. Picard predicts a correct SQL on pre-perturbation data but fails on post-perturbation data. The blue and gray areas highlight the modification on input and the errors of predicted SQLs respectively.
  • Figure 2: Pre-perturbation, post-perturbation, and relative robustness accuracy of T5-base, T5-large, and T5-3B in terms of EX.
  • Figure 3: EM accuracy of Grappa and SmBoP on pre-perturbation and post-perturbation data of DB, NLQ, and SQL.
  • Figure 4: The instructions for crowdsourcing paraphrase collection on Amazon Mechanical Turk.
  • Figure 5: The interface for crowdsourcing paraphrase collection on Amazon Mechanical Turk. Annotators are given a free text box to input their paraphrased question.
  • ...and 2 more figures