Table of Contents
Fetching ...

Valid Text-to-SQL Generation with Unification-based DeepStochLog

Ying Jiao, Luc De Raedt, Giuseppe Marra

TL;DR

This work addresses the risk of invalid SQL queries produced by language models in text-to-SQL by introducing a neurosymbolic framework that imposes strict SQL syntax and schema constraints through unification-based DCGs within DeepStochLog. It introduces LMDCGs, which integrate language models with the grammar to leverage natural language understanding while ensuring validity, enabling end-to-end differentiable inference and learning. On a subset of SQL grammars, the approach yields guaranteed validity, improved ground-truth alignment, and higher execution accuracy than several strong baselines, demonstrating the value of combining neural and symbolic components for schema-aware code generation. The method offers a practical pathway for reliable NL-to-SQL interfaces in real-world systems and provides a foundation for scaling neurosymbolic text-to-SQL with larger LMs and more expressive grammars.

Abstract

Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.

Valid Text-to-SQL Generation with Unification-based DeepStochLog

TL;DR

This work addresses the risk of invalid SQL queries produced by language models in text-to-SQL by introducing a neurosymbolic framework that imposes strict SQL syntax and schema constraints through unification-based DCGs within DeepStochLog. It introduces LMDCGs, which integrate language models with the grammar to leverage natural language understanding while ensuring validity, enabling end-to-end differentiable inference and learning. On a subset of SQL grammars, the approach yields guaranteed validity, improved ground-truth alignment, and higher execution accuracy than several strong baselines, demonstrating the value of combining neural and symbolic components for schema-aware code generation. The method offers a practical pathway for reliable NL-to-SQL interfaces in real-world systems and provides a foundation for scaling neurosymbolic text-to-SQL with larger LMs and more expressive grammars.

Abstract

Large language models have been used to translate natural language questions to SQL queries. Without hard constraints on syntax and database schema, they occasionally produce invalid queries that are not executable. These failures limit the usage of these systems in real-life scenarios. We propose a neurosymbolic framework that imposes SQL syntax and schema constraints with unification-based definite clause grammars and thus guarantees the generation of valid queries. Our framework also builds a bi-directional interface to language models to leverage their natural language understanding abilities. The evaluation results on a subset of SQL grammars show that all our output queries are valid. This work is the first step towards extending language models with unification-based grammars. We demonstrate this extension enhances the validity, execution accuracy, and ground truth alignment of the underlying language model by a large margin. Our code is available at https://github.com/ML-KULeuven/deepstochlog-lm.

Paper Structure

This paper contains 12 sections, 5 equations, 2 figures, 1 table.

Figures (2)

  • Figure 1: Illustration of a text-to-SQL instance solved by our framework (basic grammar is used for brevity). Given the inputs, the system maximizes the probability of the ground truth SQL query when it is known and produces the most probable query when the target is unknown. The first LMDCG rule $nn_{lm}$ in the logic program prompts the language model $table\_lm$ and gets a probability distribution over the three tables in the dog_kennels database. Similarly, the second one prompts $column\_lm$ and gets a probability distribution over the columns in a given table. The inference steps are shown in Fig. \ref{['fig:inference']}.
  • Figure 2: Inference steps on the text-to-SQL instance in Fig. \ref{['fig:workflow']}. (a) The SLD tree for $derives$($query$("Find the ids of professionals who have ever treated dogs.", "dog_kennels", ["SELECT", "prof_id", "FROM", "Treatments"])). Thanks to unification, the branches of the wrong table and column substitutions will fail. Failing branches are in grey. (b) The corresponding AND-OR circuit. The probabilities of failing branches are not considered.