Table of Contents
Fetching ...

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

Danna Zheng, Mirella Lapata, Jeff Z. Pan

TL;DR

Archer introduces a bilingual English–Chinese text-to-SQL dataset engineered to probe arithmetic, commonsense, and hypothetical reasoning across 20 databases and 20 domains. The corpus is built through a meticulous six-stage annotation pipeline—database selection, bilingual question construction, SQL annotation, reviews, paraphrasing, and final validation—yielding 1,042 questions per language and 521 unique SQLs. Experimental results show current LLMs and vanilla fine-tuned models perform poorly on Archer (e.g., EX of 6.73% with GPT-4+DIN-SQL), underscoring the dataset’s challenging nature and the need for advanced reasoning and value-slot handling. Archer’s mixture of long inputs, complex SQL grammar, and diverse reasoning types highlights substantial potential for progress in cross-domain text-to-SQL systems and sets a high bar for future benchmarks and model development, especially in bilingual and knowledge-integrated settings.

Abstract

We present Archer, a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. Notably, this dataset demonstrates a significantly higher level of complexity compared to existing publicly available datasets. Our evaluation shows that Archer challenges the capabilities of current state-of-the-art models, with a high-ranked model on the Spider leaderboard achieving only 6.73% execution accuracy on Archer test set. Thus, Archer presents a significant challenge for future research in this field.

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

TL;DR

Archer introduces a bilingual English–Chinese text-to-SQL dataset engineered to probe arithmetic, commonsense, and hypothetical reasoning across 20 databases and 20 domains. The corpus is built through a meticulous six-stage annotation pipeline—database selection, bilingual question construction, SQL annotation, reviews, paraphrasing, and final validation—yielding 1,042 questions per language and 521 unique SQLs. Experimental results show current LLMs and vanilla fine-tuned models perform poorly on Archer (e.g., EX of 6.73% with GPT-4+DIN-SQL), underscoring the dataset’s challenging nature and the need for advanced reasoning and value-slot handling. Archer’s mixture of long inputs, complex SQL grammar, and diverse reasoning types highlights substantial potential for progress in cross-domain text-to-SQL systems and sets a high bar for future benchmarks and model development, especially in bilingual and knowledge-integrated settings.

Abstract

We present Archer, a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. Notably, this dataset demonstrates a significantly higher level of complexity compared to existing publicly available datasets. Our evaluation shows that Archer challenges the capabilities of current state-of-the-art models, with a high-ranked model on the Spider leaderboard achieving only 6.73% execution accuracy on Archer test set. Thus, Archer presents a significant challenge for future research in this field.
Paper Structure (53 sections, 1 equation, 8 figures, 6 tables, 1 algorithm)

This paper contains 53 sections, 1 equation, 8 figures, 6 tables, 1 algorithm.

Figures (8)

  • Figure 1: Archer examples with three reasoning types: arithmetic, commonsense, and hypothetical reasoning. (See more examples in Appendix \ref{['app:archer-ex']})
  • Figure 2: The annotation process of our Archer.
  • Figure 3: GPT-3.5 + CT-3 execution accuracy comparison across and within different reasoning types. A refers to arithmetic. H refers to hypothetic. C refers to commonsense.
  • Figure 4: GPT-3.5 + CT-3 execution accuracy performance w.r.t different complexity level. The abbreviations used are as follows: QL for the average question length (1: [0,15)], 2: [15,20), 3:[30,45), 4: [45,)), SQLL for the average SQL length (1: [0,50)], 2: [50,100), 3:[100,150), 4: [150,)), VS for the average number of value slots per question (1: [0,3)], 2: [3,6), 3:[6,9), 4: [9,)), TM for the average number of tables mentioned in each SQL (1: [0,2)], 2: [2,3), 3:[3,5), 4: [5,)), NL for the average nested level per SQL (1: [0,1)], 2: [1,2), 3:[2,3), 4: [3,)).
  • Figure 5: The example of API Doc prompt, CT-3 prompt, and CT-3+COT prompt.
  • ...and 3 more figures