Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Bin Zhang; Yuxiao Ye; Guoqing Du; Xiaoru Hu; Zhishuai Li; Sun Yang; Chi Harold Liu; Rui Zhao; Ziyue Li; Hangyu Mao

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Bin Zhang, Yuxiao Ye, Guoqing Du, Xiaoru Hu, Zhishuai Li, Sun Yang, Chi Harold Liu, Rui Zhao, Ziyue Li, Hangyu Mao

TL;DR

This work tackles the lack of systematic benchmarking for Text-to-SQL by introducing a granular five-task evaluation framework and a data-augmentation dataset (BigTable-0.2k) to mitigate LLM overfitting. It systematically compares general-purpose and coding-specific LLMs across Text2SQL, debugging, optimization, SQL-to-Text, and schema linking, revealing strong model- and prompt-dependent performance with clear gaps in sub-task analyses. Key findings show that carefully designed prompts (e.g., SimpleDDL-MD-Chat) yield robust Text-to-SQL results, self-debugging and direct optimization strategies have nuanced effects, and schema linking benefits from targeted prompting (PreSQL) and foreign-key information. The study provides actionable insights into prompt engineering, task decomposition, and evaluation metrics to improve reliability and efficiency of LLM-based Text-to-SQL systems in real-world applications.

Abstract

Large Language Models (LLMs) have emerged as a powerful tool in advancing the Text-to-SQL task, significantly outperforming traditional methods. Nevertheless, as a nascent research field, there is still no consensus on the optimal prompt templates and design frameworks. Additionally, existing benchmarks inadequately explore the performance of LLMs across the various sub-tasks of the Text-to-SQL process, which hinders the assessment of LLMs' cognitive capabilities and the optimization of LLM-based solutions. To address the aforementioned issues, we firstly construct a new dataset designed to mitigate the risk of overfitting in LLMs. Then we formulate five evaluation tasks to comprehensively assess the performance of diverse methods across various LLMs throughout the Text-to-SQL process.Our study highlights the performance disparities among LLMs and proposes optimal in-context learning solutions tailored to each task. These findings offer valuable insights for enhancing the development of LLM-based Text-to-SQL systems.

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

TL;DR

Abstract

Paper Structure (36 sections, 6 equations, 6 figures, 14 tables)

This paper contains 36 sections, 6 equations, 6 figures, 14 tables.

Introduction
Related Work
Traditional Learning-based Text-to-SQL Methods
LLM-based Text-to-SQL Methods
Datasets and Evaluation Metrics
Settings
Task Formulation
Evaluation Models
Dataset Construction
Evaluation
Text2SQL
Zero-shot Prompting Optimization.
End-to-End Text2SQL evaluation.
SQL Debugging
Debugging Dataset
...and 21 more sections

Figures (6)

Figure 1: Benchmarking tasks in Text-to-SQL pipeline.
Figure 2: Overall performance of different LLMs in various sub-tasks.
Figure 3: Word cloud representation of error information for incorrect SQL queries generated by LLMs. Top: System Error and Result Error. Bottom: Detailed classification of Result Error.
Figure 4: EX (%) improvement brought by self debug.
Figure 5: Left: EX (%) improvement brought by multi-round self debug. Right: Numbers of wrong SQL queries of detailed error type during the process of multi-round self debug.
...and 1 more figures

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

TL;DR

Abstract

Benchmarking the Text-to-SQL Capability of Large Language Models: A Comprehensive Evaluation

Authors

TL;DR

Abstract

Table of Contents

Figures (6)