From Natural Language to SQL: Review of LLM-based Text-to-SQL Systems
Ali Mohammadjafari, Anthony S. Maida, Raju Gottumukkala
TL;DR
The paper surveys LLM-based Text-to-SQL systems, focusing on how Retrieval Augmented Generation and Graph RAG address NL-to-SQL challenges such as schema understanding, ambiguity, and cross-domain generalization. It traces evolution from rule-based to LLM-based architectures, studies benchmarks and metrics, and offers a taxonomy of methods including in-context learning, fine-tuning, and RAG. The authors highlight Graph RAG as a promising direction for grounding queries in structured knowledge graphs, improving accuracy and scalability. It also discusses remaining limitations and open challenges, including computational efficiency, dynamic schemas, contextual disambiguation, ethics and privacy, and the role of human-in-the-loop, providing directions for future research.
Abstract
LLMs when used with Retrieval Augmented Generation (RAG), are greatly improving the SOTA of translating natural language queries to structured and correct SQL. Unlike previous reviews, this survey provides a comprehensive study of the evolution of LLM-based text-to-SQL systems, from early rule-based models to advanced LLM approaches that use (RAG) systems. We discuss benchmarks, evaluation methods, and evaluation metrics. Also, we uniquely study the use of Graph RAGs for better contextual accuracy and schema linking in these systems. Finally, we highlight key challenges such as computational efficiency, model robustness, and data privacy toward improvements of LLM-based text-to-SQL systems.
