Table of Contents
Fetching ...

Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries

Jonathan Fürst, Catherine Kosten, Farhad Nooralahzadeh, Yi Zhang, Kurt Stockinger

TL;DR

This work investigates how Text-to-SQL systems robustly translate natural language queries into SQL under real-world data-model variations. It analyzes a multi-year FootballDB deployment with three data models, using real user queries gathered during the FIFA World Cup 2022 to evaluate small, medium, and large language models, along with training-data and pre-/post-processing effects. Key contributions include a formalization of the data-model design space, the introduction of FootballDB with 1,200 labeled NL-SQL pairs across three schemas, and actionable insights showing that data-model design significantly impacts accuracy for smaller models, while larger LLMs exhibit robustness but face prohibitive inference times. The results offer practical guidance for deploying Text-to-SQL in realistic domains and motivate future work on data-model-aware design and efficiency optimizations, with FootballDB released for community benchmarking.

Abstract

Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while Text-to-SQL systems that incorporate such models continuously reach new high scores on -- often synthetic -- benchmark datasets, a systematic exploration of their robustness towards different data models in a real-world, realistic scenario is notably missing. This paper provides the first in-depth evaluation of the data model robustness of Text-to-SQL systems in practice based on a multi-year international project focused on Text-to-SQL interfaces. Our evaluation is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022, during which about 6K natural language questions were asked and executed. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models. For each data model, we explore the performance of representative Text-to-SQL systems and language models. We further quantify the impact of training data size, pre-, and post-processing steps as well as language model inference time. Our comprehensive evaluation sheds light on the design choices of real-world Text-to-SQL systems and their impact on moving from research prototypes to real deployments. Last, we provide a new benchmark dataset to the community, which is the first to enable the evaluation of different data models for the same dataset and is substantially more challenging than most previous datasets in terms of query complexity.

Evaluating the Data Model Robustness of Text-to-SQL Systems Based on Real User Queries

TL;DR

This work investigates how Text-to-SQL systems robustly translate natural language queries into SQL under real-world data-model variations. It analyzes a multi-year FootballDB deployment with three data models, using real user queries gathered during the FIFA World Cup 2022 to evaluate small, medium, and large language models, along with training-data and pre-/post-processing effects. Key contributions include a formalization of the data-model design space, the introduction of FootballDB with 1,200 labeled NL-SQL pairs across three schemas, and actionable insights showing that data-model design significantly impacts accuracy for smaller models, while larger LLMs exhibit robustness but face prohibitive inference times. The results offer practical guidance for deploying Text-to-SQL in realistic domains and motivate future work on data-model-aware design and efficiency optimizations, with FootballDB released for community benchmarking.

Abstract

Text-to-SQL systems (also known as NL-to-SQL systems) have become an increasingly popular solution for bridging the gap between user capabilities and SQL-based data access. These systems translate user requests in natural language to valid SQL statements for a specific database. Recent Text-to-SQL systems have benefited from the rapid improvement of transformer-based language models. However, while Text-to-SQL systems that incorporate such models continuously reach new high scores on -- often synthetic -- benchmark datasets, a systematic exploration of their robustness towards different data models in a real-world, realistic scenario is notably missing. This paper provides the first in-depth evaluation of the data model robustness of Text-to-SQL systems in practice based on a multi-year international project focused on Text-to-SQL interfaces. Our evaluation is based on a real-world deployment of FootballDB, a system that was deployed over a 9 month period in the context of the FIFA World Cup 2022, during which about 6K natural language questions were asked and executed. All of our data is based on real user questions that were asked live to the system. We manually labeled and translated a subset of these questions for three different data models. For each data model, we explore the performance of representative Text-to-SQL systems and language models. We further quantify the impact of training data size, pre-, and post-processing steps as well as language model inference time. Our comprehensive evaluation sheds light on the design choices of real-world Text-to-SQL systems and their impact on moving from research prototypes to real deployments. Last, we provide a new benchmark dataset to the community, which is the first to enable the evaluation of different data models for the same dataset and is substantially more challenging than most previous datasets in terms of query complexity.
Paper Structure (20 sections, 8 figures, 8 tables)

This paper contains 20 sections, 8 figures, 8 tables.

Figures (8)

  • Figure 1: Design space of deep learning-based Text-to-SQL methods (fine-tuning phase in blue, prediction phase in orange). NL questions, database schema and potentially content serve as input. Pre-processing: Some methods perform input enrichment before the encoded input is fed into a fine-tuned language model. Post-processing: Some methods apply techniques such as intermediate representation (IR) or constrained decoding to improve the quality of the generated SQL query.
  • Figure 2: Implementation of the Text-to-SQL system deployed before, during and after the FIFA World Cup 2022.
  • Figure 3: World Cup Schema Diagram v1. PK = Primary Key. FK = Foreign Key. The tables highlighted in blue are affected by re-designing the data model in various iterations. Note that the 1:n relationships between tables national_team and match as well as between national_team and world_cup contain multiple PK/FK references.
  • Figure 4: Example SQL query in all three data models of the natural language question: What was the score between Germany and Brazil in 2014? Components shown in yellow indicate the unsuccessful interpretation of joins; components shown in blue indicate set operations; components shown in orange and red indicate the multiple instantiations of a table in a SELECT clause.
  • Figure 5: World Cup Schema Diagram Changes v2. Tables highlighted in red are changed with respected to data model version 1 shown in Figure \ref{['fig:db_model_v1']}. Original 1:n relationships containing more than one PK/FK references are remodeled to only contain a single PK/FK reference between two tables.
  • ...and 3 more figures