Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

Danna Zheng; Mirella Lapata; Jeff Z. Pan

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

Danna Zheng, Mirella Lapata, Jeff Z. Pan

TL;DR

Archer introduces a bilingual English–Chinese text-to-SQL dataset engineered to probe arithmetic, commonsense, and hypothetical reasoning across 20 databases and 20 domains. The corpus is built through a meticulous six-stage annotation pipeline—database selection, bilingual question construction, SQL annotation, reviews, paraphrasing, and final validation—yielding 1,042 questions per language and 521 unique SQLs. Experimental results show current LLMs and vanilla fine-tuned models perform poorly on Archer (e.g., EX of 6.73% with GPT-4+DIN-SQL), underscoring the dataset’s challenging nature and the need for advanced reasoning and value-slot handling. Archer’s mixture of long inputs, complex SQL grammar, and diverse reasoning types highlights substantial potential for progress in cross-domain text-to-SQL systems and sets a high bar for future benchmarks and model development, especially in bilingual and knowledge-integrated settings.

Abstract

We present Archer, a challenging bilingual text-to-SQL dataset specific to complex reasoning, including arithmetic, commonsense and hypothetical reasoning. It contains 1,042 English questions and 1,042 Chinese questions, along with 521 unique SQL queries, covering 20 English databases across 20 domains. Notably, this dataset demonstrates a significantly higher level of complexity compared to existing publicly available datasets. Our evaluation shows that Archer challenges the capabilities of current state-of-the-art models, with a high-ranked model on the Spider leaderboard achieving only 6.73% execution accuracy on Archer test set. Thus, Archer presents a significant challenge for future research in this field.

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

TL;DR

Abstract

Paper Structure (53 sections, 1 equation, 8 figures, 6 tables, 1 algorithm)

This paper contains 53 sections, 1 equation, 8 figures, 6 tables, 1 algorithm.

Introduction
Reasoning Types
Arithmetic reasoning
Commonsense reasoning
Hypothetical reasoning
Corpus Construction
Database Collection
Question Annotation
1) Arithmetic Reasoning:
2) Hypothetical Reasoning:
3) Commonsense Reasoning:
4) Complex SQL Grammar:
SQL Annotation
1) Clarity Ensuring:
2) SQL Writing:
...and 38 more sections

Figures (8)

Figure 1: Archer examples with three reasoning types: arithmetic, commonsense, and hypothetical reasoning. (See more examples in Appendix \ref{['app:archer-ex']})
Figure 2: The annotation process of our Archer.
Figure 3: GPT-3.5 + CT-3 execution accuracy comparison across and within different reasoning types. A refers to arithmetic. H refers to hypothetic. C refers to commonsense.
Figure 4: GPT-3.5 + CT-3 execution accuracy performance w.r.t different complexity level. The abbreviations used are as follows: QL for the average question length (1: [0,15)], 2: [15,20), 3:[30,45), 4: [45,)), SQLL for the average SQL length (1: [0,50)], 2: [50,100), 3:[100,150), 4: [150,)), VS for the average number of value slots per question (1: [0,3)], 2: [3,6), 3:[6,9), 4: [9,)), TM for the average number of tables mentioned in each SQL (1: [0,2)], 2: [2,3), 3:[3,5), 4: [5,)), NL for the average nested level per SQL (1: [0,1)], 2: [1,2), 3:[2,3), 4: [3,)).
Figure 5: The example of API Doc prompt, CT-3 prompt, and CT-3+COT prompt.
...and 3 more figures

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

TL;DR

Abstract

Archer: A Human-Labeled Text-to-SQL Dataset with Arithmetic, Commonsense and Hypothetical Reasoning

Authors

TL;DR

Abstract

Table of Contents

Figures (8)