Table of Contents
Fetching ...

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection

Xingyu Ma, Xin Tian, Lingxiang Wu, Xuepeng Wang, Xueming Tang, Jinqiao Wang

TL;DR

This paper addresses hallucination and domain-knowledge gaps in Text-to-SQL by introducing a domain database knowledge injection approach. It trains LLMs with three objectives that encode semantic and schema information from databases and cell values, enhancing column/table name semantics and their co-occurrence. Evaluations on the Spider dataset across multiple open-source LLMs show consistent gains in Execution Match and Exact Match, with notable improvements in column/table name generation and value-column alignment. The method demonstrates robustness to synonyms and unseen databases, though privacy and training-cost considerations remain, suggesting directions for privacy-preserving techniques and downstream integration to reduce overhead.

Abstract

Text-to-SQL is a subtask in semantic parsing that has seen rapid progress with the evolution of Large Language Models (LLMs). However, LLMs face challenges due to hallucination issues and a lack of domain-specific database knowledge(such as table schema and cell values). As a result, they can make errors in generating table names, columns, and matching values to the correct columns in SQL statements. This paper introduces a method of knowledge injection to enhance LLMs' ability to understand schema contents by incorporating prior knowledge. This approach improves their performance in Text-to-SQL tasks. Experimental results show that pre-training LLMs on domain-specific database knowledge and fine-tuning them on downstream Text-to-SQL tasks significantly improves the Execution Match (EX) and Exact Match (EM) metrics across various models. This effectively reduces errors in generating column names and matching values to the columns. Furthermore, the knowledge-injected models can be applied to many downstream Text-to-SQL tasks, demonstrating the generalizability of the approach presented in this paper.

Enhancing Text-to-SQL Capabilities of Large Language Models via Domain Database Knowledge Injection

TL;DR

This paper addresses hallucination and domain-knowledge gaps in Text-to-SQL by introducing a domain database knowledge injection approach. It trains LLMs with three objectives that encode semantic and schema information from databases and cell values, enhancing column/table name semantics and their co-occurrence. Evaluations on the Spider dataset across multiple open-source LLMs show consistent gains in Execution Match and Exact Match, with notable improvements in column/table name generation and value-column alignment. The method demonstrates robustness to synonyms and unseen databases, though privacy and training-cost considerations remain, suggesting directions for privacy-preserving techniques and downstream integration to reduce overhead.

Abstract

Text-to-SQL is a subtask in semantic parsing that has seen rapid progress with the evolution of Large Language Models (LLMs). However, LLMs face challenges due to hallucination issues and a lack of domain-specific database knowledge(such as table schema and cell values). As a result, they can make errors in generating table names, columns, and matching values to the correct columns in SQL statements. This paper introduces a method of knowledge injection to enhance LLMs' ability to understand schema contents by incorporating prior knowledge. This approach improves their performance in Text-to-SQL tasks. Experimental results show that pre-training LLMs on domain-specific database knowledge and fine-tuning them on downstream Text-to-SQL tasks significantly improves the Execution Match (EX) and Exact Match (EM) metrics across various models. This effectively reduces errors in generating column names and matching values to the columns. Furthermore, the knowledge-injected models can be applied to many downstream Text-to-SQL tasks, demonstrating the generalizability of the approach presented in this paper.
Paper Structure (11 sections, 1 equation, 8 tables)