Evaluating LLMs for Text-to-SQL Generation With Complex SQL Workload
Limin Ma, Ken Pu, Ying Zhu
TL;DR
The study benchmarks text-to-SQL generation against the complex TPC-DS workload, comparing it to BIRD and Spider to quantify structural complexity. It defines bag-valued and numeric SQL features to measure query structure, then evaluates 11 large language models on generating correct queries from TPC-DS descriptions with schema-aware prompts. Results show TPC-DS queries are significantly more complex, and current LLMs fail to produce accurate, decision-making queries, even after retries. The work underscores the gap between existing benchmarks and real-world complexity and proposes concrete directions—incremental generation, targeted prompting, model fine-tuning, and human-in-the-loop strategies—for progress.
Abstract
This study presents a comparative analysis of the a complex SQL benchmark, TPC-DS, with two existing text-to-SQL benchmarks, BIRD and Spider. Our findings reveal that TPC-DS queries exhibit a significantly higher level of structural complexity compared to the other two benchmarks. This underscores the need for more intricate benchmarks to simulate realistic scenarios effectively. To facilitate this comparison, we devised several measures of structural complexity and applied them across all three benchmarks. The results of this study can guide future research in the development of more sophisticated text-to-SQL benchmarks. We utilized 11 distinct Language Models (LLMs) to generate SQL queries based on the query descriptions provided by the TPC-DS benchmark. The prompt engineering process incorporated both the query description as outlined in the TPC-DS specification and the database schema of TPC-DS. Our findings indicate that the current state-of-the-art generative AI models fall short in generating accurate decision-making queries. We conducted a comparison of the generated queries with the TPC-DS gold standard queries using a series of fuzzy structure matching techniques based on query features. The results demonstrated that the accuracy of the generated queries is insufficient for practical real-world application.
