Table of Contents
Fetching ...

SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

Hasan Alp Caferoğlu, Mehmet Serhat Çelik, Özgür Ulusoy

TL;DR

SING-SQL tackles the need for enterprise-friendly, in-domain Text-to-SQL systems by generating high-coverage synthetic data tailored to a specific database. It introduces a two-stage framework: hierarchical sub-schema construction to create diverse yet tractable contexts, followed by synthetic Text-to-SQL generation with quality validation, executability repair, and reasoning traces, plus a second column-focused pass to balance schema coverage. SingSQL-LM, a family of compact models fine-tuned with LoRA on this synthetic data, demonstrates strong in-domain generalization, achieving state-of-the-art open-model results on the BIRD-based California Schools benchmark and superior performance on synthetic evaluation splits. The findings underscore the value of in-domain synthetic supervision for both training and evaluating enterprise Text-to-SQL systems, and they highlight schema-grounding and schema-only inference as robust strategies for practical deployment. Overall, SING-SQL provides a scalable, database-agnostic path for building and evaluating enterprise-grade Text-to-SQL pipelines with limited annotated data and computational resources.

Abstract

Translating natural language questions into SQL has become a core challenge in enabling non-technical users to query databases. While recent work has explored large-scale synthetic data generation to improve model performance through post-training, most efforts emphasize cross-domain generalization. This leaves a gap for real-world enterprise scenarios, where models need to specialize to a single database schema and organizations require to be able to evaluate their Text-to-SQL systems on their own databases. To address this, we introduce SING-SQL, a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database, without relying on SQL logs or manual annotations. Our approach hierarchically partitions a database schema into sub-schemas, synthesizes SQL queries across multiple complexity levels, and applies a quality-aware pipeline that includes LLM-as-a-judge validation, executability checks, automatic repair, and column balancing. We further release SingSQL-LM, a family of compact language models fine-tuned on the synthetic data, achieving strong in-domain generalization. On the subset of the BIRD benchmark, SingSQL-LM-3B-R64 reaches 82.87% Soft F1 and 73.03% EX upper bound with 32 candidates, outperforming the best 3B-scale baseline by +16.21 in Soft F1 and +12.36 in EX. At the 1.5B scale, SingSQL-LM-1.5B-R64 improves over prior systems by +9.30 in Soft F1 and +4.49 in EX. On synthetic evaluation sets, SingSQL-LMs exceed prior systems by wide margins, establishing state-of-the-art performance among open models at comparable scales. Our study of context management strategies reveals that schema-free fine-tuning combined with schema-only inference provides the most robust results. These findings establish SING-SQL as a scalable, database-agnostic paradigm for producing and evaluating enterprise-grade Text-to-SQL systems.

SING-SQL: A Synthetic Data Generation Framework for In-Domain Text-to-SQL Translation

TL;DR

SING-SQL tackles the need for enterprise-friendly, in-domain Text-to-SQL systems by generating high-coverage synthetic data tailored to a specific database. It introduces a two-stage framework: hierarchical sub-schema construction to create diverse yet tractable contexts, followed by synthetic Text-to-SQL generation with quality validation, executability repair, and reasoning traces, plus a second column-focused pass to balance schema coverage. SingSQL-LM, a family of compact models fine-tuned with LoRA on this synthetic data, demonstrates strong in-domain generalization, achieving state-of-the-art open-model results on the BIRD-based California Schools benchmark and superior performance on synthetic evaluation splits. The findings underscore the value of in-domain synthetic supervision for both training and evaluating enterprise Text-to-SQL systems, and they highlight schema-grounding and schema-only inference as robust strategies for practical deployment. Overall, SING-SQL provides a scalable, database-agnostic path for building and evaluating enterprise-grade Text-to-SQL pipelines with limited annotated data and computational resources.

Abstract

Translating natural language questions into SQL has become a core challenge in enabling non-technical users to query databases. While recent work has explored large-scale synthetic data generation to improve model performance through post-training, most efforts emphasize cross-domain generalization. This leaves a gap for real-world enterprise scenarios, where models need to specialize to a single database schema and organizations require to be able to evaluate their Text-to-SQL systems on their own databases. To address this, we introduce SING-SQL, a fully automated two-stage framework for generating high-quality, high-coverage synthetic Text-to-SQL data for any target database, without relying on SQL logs or manual annotations. Our approach hierarchically partitions a database schema into sub-schemas, synthesizes SQL queries across multiple complexity levels, and applies a quality-aware pipeline that includes LLM-as-a-judge validation, executability checks, automatic repair, and column balancing. We further release SingSQL-LM, a family of compact language models fine-tuned on the synthetic data, achieving strong in-domain generalization. On the subset of the BIRD benchmark, SingSQL-LM-3B-R64 reaches 82.87% Soft F1 and 73.03% EX upper bound with 32 candidates, outperforming the best 3B-scale baseline by +16.21 in Soft F1 and +12.36 in EX. At the 1.5B scale, SingSQL-LM-1.5B-R64 improves over prior systems by +9.30 in Soft F1 and +4.49 in EX. On synthetic evaluation sets, SingSQL-LMs exceed prior systems by wide margins, establishing state-of-the-art performance among open models at comparable scales. Our study of context management strategies reveals that schema-free fine-tuning combined with schema-only inference provides the most robust results. These findings establish SING-SQL as a scalable, database-agnostic paradigm for producing and evaluating enterprise-grade Text-to-SQL systems.

Paper Structure

This paper contains 35 sections, 9 figures, 9 tables, 5 algorithms.

Figures (9)

  • Figure 1: Overview of the Synthetic Data Generation Framework
  • Figure 2: Join Count Comparison of Bird and Synthetic Data
  • Figure 3: Aggregation Comparison of Bird and Synthetic Data
  • Figure 4: Overview of the SING-SQL Schema Filtering
  • Figure 5: Example of flawed SQL-to-Text translation where the model misinterprets column semantics.
  • ...and 4 more figures