Table of Contents
Fetching ...

Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

Zijin Hong, Zheng Yuan, Hao Chen, Qinggang Zhang, Feiran Huang, Xiao Huang

TL;DR

This work tackles the knowledge gap in text-to-SQL by introducing Knowledge-to-SQL, which leverages a Data Expert LLM (DELLM) to generate expert knowledge from a question-schema pair. DELLM combines a table-reading module and supervised fine-tuning (SFT) with a downstream Preference Learning via Database Feedback (PLDBF) to refine knowledge based on database execution and ground-truth SQL contributions. A direct preference optimization (DPO) framework is used to derive a PL-refined DELLM, which then augments downstream LLMs in generating accurate SQL. Experiments on the BIRD and Spider benchmarks show that DELLM consistently improves execution accuracy (EX) and valid efficiency score (VES) across baselines, with ablations highlighting the importance of database feedback and table reading; the approach is released as open-source for further research and practical adoption.

Abstract

Generating accurate SQL queries for user questions (text-to-SQL) has been a long-standing challenge since it requires a deep understanding of both the user's question and the corresponding database schema in order to retrieve the desired content accurately. Existing methods rely on the comprehensive capability of large language models (LLMs) to generate the SQL. However, some necessary knowledge is not explicitly included in the database schema and user question or has been learned by LLMs. Thus, the generated SQL of the knowledge-insufficient questions may be inaccurate, negatively influencing the text-to-SQL models' performance and robustness. To address this challenge, we propose the Knowledge-to-SQL framework, which employs tailored Data Expert LLM (DELLM) to provide helpful knowledge for all text-to-SQL models. Specifically, we introduce the detailed implementation of DELLM regarding table reading and the basic fine-tuning process. We further propose a Preference Learning via Database Feedback (PLDBF) strategy, refining the DELLM to generate more helpful knowledge for LLMs. Extensive experiments verify that DELLM can enhance the state-of-the-art approaches for text-to-SQL tasks. The corresponding code of DELLM is released for further research.

Knowledge-to-SQL: Enhancing SQL Generation with Data Expert LLM

TL;DR

This work tackles the knowledge gap in text-to-SQL by introducing Knowledge-to-SQL, which leverages a Data Expert LLM (DELLM) to generate expert knowledge from a question-schema pair. DELLM combines a table-reading module and supervised fine-tuning (SFT) with a downstream Preference Learning via Database Feedback (PLDBF) to refine knowledge based on database execution and ground-truth SQL contributions. A direct preference optimization (DPO) framework is used to derive a PL-refined DELLM, which then augments downstream LLMs in generating accurate SQL. Experiments on the BIRD and Spider benchmarks show that DELLM consistently improves execution accuracy (EX) and valid efficiency score (VES) across baselines, with ablations highlighting the importance of database feedback and table reading; the approach is released as open-source for further research and practical adoption.

Abstract

Generating accurate SQL queries for user questions (text-to-SQL) has been a long-standing challenge since it requires a deep understanding of both the user's question and the corresponding database schema in order to retrieve the desired content accurately. Existing methods rely on the comprehensive capability of large language models (LLMs) to generate the SQL. However, some necessary knowledge is not explicitly included in the database schema and user question or has been learned by LLMs. Thus, the generated SQL of the knowledge-insufficient questions may be inaccurate, negatively influencing the text-to-SQL models' performance and robustness. To address this challenge, we propose the Knowledge-to-SQL framework, which employs tailored Data Expert LLM (DELLM) to provide helpful knowledge for all text-to-SQL models. Specifically, we introduce the detailed implementation of DELLM regarding table reading and the basic fine-tuning process. We further propose a Preference Learning via Database Feedback (PLDBF) strategy, refining the DELLM to generate more helpful knowledge for LLMs. Extensive experiments verify that DELLM can enhance the state-of-the-art approaches for text-to-SQL tasks. The corresponding code of DELLM is released for further research.
Paper Structure (33 sections, 12 equations, 4 figures, 8 tables)

This paper contains 33 sections, 12 equations, 4 figures, 8 tables.

Figures (4)

  • Figure 1: A sketch map illustrating the significance of incorporating expert knowledge in the text-to-SQL implementation. In the given example, the generation without expert knowledge makes mistakes in arithmetic reasoning and data conditions. Expert knowledge bridges the knowledge gap between the LLMs and the database, which assists the LLMs in generating accurate SQL.
  • Figure 2: The overview of our approach. The upper is the overall knowledge-to-SQL framework. The details of DELLM are presented at the bottom. On the left side, we have the framework of DELLM, including supervised fine-tuning (SFT) and table reading. On the right side, we introduce preference learning via database feedback (PLDBF), which is employed to further refine the performance of DELLM.
  • Figure 3: Improvement to GPT-4 on different metrics with DELLM on different ratios of training data.
  • Figure 4: Different influences of DELLM bring on GPT-4's performance on the BIRD dev set.