Table of Contents
Fetching ...

Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

Jinheon Baek, Horst Samulowitz, Oktie Hassanzadeh, Dharmashankar Subramanian, Sola Shirai, Alfio Gliozzo, Debarun Bhattacharjya

TL;DR

This paper introduces KAT-SQL, a knowledge-base–driven framework for text-to-SQL that constructs a comprehensive repository of domain- and schema-relevant knowledge and retrieves it to augment SQL generation. The knowledge base is built from existing samples and database schemas and expanded automatically via LLMs with context-rich prompts and carefully selected few-shot examples, then leveraged through embedding-based retrieval and query-conditioned refinement. Across BIRD, Spider, and CSTINSIGHT datasets and multiple database overlap scenarios, KAT-SQL consistently outperforms knowledge-augmented baselines and approaches oracle-level knowledge, demonstrating strong gains in execution accuracy and efficiency. The work also shows that the knowledge base generalizes to unseen domains, maintains robustness across different LLMs, and remains efficient in real-time usage, underscoring its practical impact for scalable, knowledge-grounded text-to-SQL systems.

Abstract

Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.

Knowledge Base Construction for Knowledge-Augmented Text-to-SQL

TL;DR

This paper introduces KAT-SQL, a knowledge-base–driven framework for text-to-SQL that constructs a comprehensive repository of domain- and schema-relevant knowledge and retrieves it to augment SQL generation. The knowledge base is built from existing samples and database schemas and expanded automatically via LLMs with context-rich prompts and carefully selected few-shot examples, then leveraged through embedding-based retrieval and query-conditioned refinement. Across BIRD, Spider, and CSTINSIGHT datasets and multiple database overlap scenarios, KAT-SQL consistently outperforms knowledge-augmented baselines and approaches oracle-level knowledge, demonstrating strong gains in execution accuracy and efficiency. The work also shows that the knowledge base generalizes to unseen domains, maintains robustness across different LLMs, and remains efficient in real-time usage, underscoring its practical impact for scalable, knowledge-grounded text-to-SQL systems.

Abstract

Text-to-SQL aims to translate natural language queries into SQL statements, which is practical as it enables anyone to easily retrieve the desired information from databases. Recently, many existing approaches tackle this problem with Large Language Models (LLMs), leveraging their strong capability in understanding user queries and generating corresponding SQL code. Yet, the parametric knowledge in LLMs might be limited to covering all the diverse and domain-specific queries that require grounding in various database schemas, which makes generated SQLs less accurate oftentimes. To tackle this, we propose constructing the knowledge base for text-to-SQL, a foundational source of knowledge, from which we retrieve and generate the necessary knowledge for given queries. In particular, unlike existing approaches that either manually annotate knowledge or generate only a few pieces of knowledge for each query, our knowledge base is comprehensive, which is constructed based on a combination of all the available questions and their associated database schemas along with their relevant knowledge, and can be reused for unseen databases from different datasets and domains. We validate our approach on multiple text-to-SQL datasets, considering both the overlapping and non-overlapping database scenarios, where it outperforms relevant baselines substantially.

Paper Structure

This paper contains 34 sections, 3 figures, 10 tables, 1 algorithm.

Figures (3)

  • Figure 1: (A) Text-to-SQL aims to translate a user query into a SQL statement executable over a database, to access the desired information. (B) Existing Text-to-SQL with Knowledge Generation approaches first generate the knowledge relevant to the user query and then formulate the SQL statement with this generated knowledge. (C) Our Text-to-SQL with Knowledge Base Construction approach builds the repository of the knowledge and then reuses the knowledge within it across multiple queries and databases. (Right:) We observe that the knowledge in the training set of the text-to-SQL benchmark dataset bird covers 21% of the knowledge required for test-time queries, and our constructed knowledge base further covers 50% of them.
  • Figure 2: Knowledge-Augmented Text-to-SQL
  • Figure 3: Results for coverage and relevance of knowledge entries in the constructed knowledge base against gold knowledge, with different numbers of knowledge generation steps.