Table of Contents
Fetching ...

ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

Jipeng Zhang, Haolin Yang, Kehao Miao, Ruiyuan Zhang, Renjie Pi, Jiahui Gao, Xiaofang Zhou

TL;DR

ExeSQL tackles the cross-dialect text-to-SQL gap by integrating translation bootstrapping, execution-driven rejection sampling, and offline preference optimization. It grounds SQL generation in executable semantics by validating dialect-specific queries against real databases and iteratively refining both data and model behavior. The method is formalized with a learning objective that rewards execution success and leverages Direct Preference Optimization to align outputs with executable SQL, yielding strong gains across PostgreSQL, MySQL, and Oracle. Empirical results demonstrate solid improvements over strong baselines and robust generalization across ID and OOD benchmarks, suggesting practical impact for real-world multi-dialect SQL generation.

Abstract

Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

ExeSQL: Self-Taught Text-to-SQL Models with Execution-Driven Bootstrapping for SQL Dialects

TL;DR

ExeSQL tackles the cross-dialect text-to-SQL gap by integrating translation bootstrapping, execution-driven rejection sampling, and offline preference optimization. It grounds SQL generation in executable semantics by validating dialect-specific queries against real databases and iteratively refining both data and model behavior. The method is formalized with a learning objective that rewards execution success and leverages Direct Preference Optimization to align outputs with executable SQL, yielding strong gains across PostgreSQL, MySQL, and Oracle. Empirical results demonstrate solid improvements over strong baselines and robust generalization across ID and OOD benchmarks, suggesting practical impact for real-world multi-dialect SQL generation.

Abstract

Recent text-to-SQL models have achieved strong performance, but their effectiveness remains largely confined to SQLite due to dataset limitations. However, real-world applications require SQL generation across multiple dialects with varying syntax and specialized features, which remains a challenge for current models. The main obstacle in building a dialect-aware model lies in acquiring high-quality dialect-specific data. Data generated purely through static prompting - without validating SQLs via execution - tends to be noisy and unreliable. Moreover, the lack of real execution environments in the training loop prevents models from grounding their predictions in executable semantics, limiting generalization despite surface-level improvements from data filtering. This work introduces ExeSQL, a text-to-SQL framework with execution-driven, agentic bootstrapping. The method consists of iterative query generation, execution-based filtering (e.g., rejection sampling), and preference-based training, enabling the model to adapt to new SQL dialects through verifiable, feedback-guided learning. Experiments show that ExeSQL bridges the dialect gap in text-to-SQL, achieving average improvements of 15.2%, 10.38%, and 4.49% over GPT-4o on PostgreSQL, MySQL, and Oracle, respectively, across multiple datasets of varying difficulty.

Paper Structure

This paper contains 45 sections, 5 equations, 6 figures, 25 tables.

Figures (6)

  • Figure 1: Given a natural language question, different SQL dialects require distinct syntax adjustments, such as explicit type casting in PostgreSQL. Beyond the traditional text-input–SQL-output formulation, we incorporate the database environment to enable agentic execution feedback for data synthesis and training.
  • Figure 2: Execution-based error feedback loop for dialect-specific SQL refinement. Through this, we can collect a bootstrap dataset to resolve the cold-start issue of training expert dialect model.
  • Figure 3: Pipeline for Dialect Text-to-SQL Data Generation and Model Training. The framework consists of three stages: (1) Translation Bootstrapping: A bootstrap text-to-SQL model is fine-tuned using SQL translations from an existing dataset (e.g., SQLite) to other dialects (e.g., MySQL, PostgreSQL). (2) Iterative Data Generation and Training: The model generates multiple SQL candidates per question, which are validated via execution feedback. Correct queries are retained to refine the dataset, enabling iterative self-improvement. (3) Preference Enhancement: A Direct Preference Optimization (DPO) step is applied to distinguish correct and incorrect SQL queries. High-quality pairs (question, correct SQL) are used to further improve the model’s performance and preference learning, ensuring both correctness and efficiency in SQL generation.
  • Figure 4: Retention rate of correct dialect SQL under different best-of-N sampling strategies on 1,000 queries. Results show the bootstrapped model already produces many correct samples, with larger N further improving correctness.
  • Figure 5: SQLite to PostgreSQL process
  • ...and 1 more figures