Table of Contents
Fetching ...

Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

Manasi Patwardhan, Ayush Agarwal, Shabbirhussain Bhaisaheb, Aseem Arora, Lovekesh Vig, Sunita Sarawagi

TL;DR

The paper tackles the challenge of translating natural language queries to SQL across diverse and often opaque enterprise DB schemas. It introduces a DB-level domain knowledge framework that solicits, structures, and indexes domain statements, then retrieves the most relevant statements via a substring-based matching mechanism to augment LLM-based NL-SQL parsing. Key contributions include a structured DS format linking NL expressions to SQL snippets, a DBA-validated structuring process, and an efficient retrieval method that significantly improves execution accuracy over baselines across multiple LLMs and schemas. The work demonstrates practical gains for enterprise databases and lays groundwork for real-time DS updates and automated domain-expression disambiguation in the NL-to-SQL pipeline.

Abstract

The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

Retrieval and Augmentation of Domain Knowledge for Text-to-SQL Semantic Parsing

TL;DR

The paper tackles the challenge of translating natural language queries to SQL across diverse and often opaque enterprise DB schemas. It introduces a DB-level domain knowledge framework that solicits, structures, and indexes domain statements, then retrieves the most relevant statements via a substring-based matching mechanism to augment LLM-based NL-SQL parsing. Key contributions include a structured DS format linking NL expressions to SQL snippets, a DBA-validated structuring process, and an efficient retrieval method that significantly improves execution accuracy over baselines across multiple LLMs and schemas. The work demonstrates practical gains for enterprise databases and lays groundwork for real-time DS updates and automated domain-expression disambiguation in the NL-to-SQL pipeline.

Abstract

The performance of Large Language Models (LLMs) for translating Natural Language (NL) queries into SQL varies significantly across databases (DBs). NL queries are often expressed using a domain specific vocabulary, and mapping these to the correct SQL requires an understanding of the embedded domain expressions, their relationship to the DB schema structure. Existing benchmarks rely on unrealistic, ad-hoc query specific textual hints for expressing domain knowledge. In this paper, we propose a systematic framework for associating structured domain statements at the database level. We present retrieval of relevant structured domain statements given a user query using sub-string level match. We evaluate on eleven realistic DB schemas covering diverse domains across five open-source and proprietary LLMs and demonstrate that (1) DB level structured domain statements are more practical and accurate than existing ad-hoc query specific textual domain statements, and (2) Our sub-string match based retrieval of relevant domain statements provides significantly higher accuracy than other retrieval approaches.

Paper Structure

This paper contains 13 sections, 1 equation, 2 figures, 11 tables.

Figures (2)

  • Figure 1: Framework for Enterprise NL-SQL Semantic Parsing. Mentions of same entities across DB Schema, NL and Structured Domain Statements, NL Query and Ground Truth SQL, are coded with the same color.
  • Figure 2: Illustrative Example for Sub-String based Ranking (SbR) Mechanism