Table of Contents
Fetching ...

SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

Mohammadreza Pourreza, Ruoxi Sun, Hailong Li, Lesly Miculicich, Tomas Pfister, Sercan O. Arik

TL;DR

SQL-GEN is introduced, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials, and a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects.

Abstract

Recent advances in Text-to-SQL have largely focused on the SQLite dialect, neglecting the diverse landscape of SQL dialects like BigQuery and PostgreSQL. This limitation is due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To address this, we introduce SQL-GEN, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials. SQL-GEN significantly improves cross-dialect Text-to-SQL performance, boosting execution accuracy by up to 20\% over existing methods. This performance gain narrows the gap with models trained on large-scale human-annotated data. Furthermore, combining synthetic data from SQL-GEN with human-annotated data yields additional improvements of up to 5.6\%. To unify multi-dialect capabilities within a single model, we propose a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects. Our approach merges self-attention layers from dialect-specific models and initializes expert gates using dialect-specific keywords. This leads to a versatile model optimized for multiple SQL dialects, outperforming single-dialect models and significantly enhancing overall performance.

SQL-GEN: Bridging the Dialect Gap for Text-to-SQL Via Synthetic Data And Model Merging

TL;DR

SQL-GEN is introduced, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials, and a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects.

Abstract

Recent advances in Text-to-SQL have largely focused on the SQLite dialect, neglecting the diverse landscape of SQL dialects like BigQuery and PostgreSQL. This limitation is due to the diversity in SQL syntaxes and functions, along with the high cost of collecting and curating SQL-specific training data. To address this, we introduce SQL-GEN, a framework for generating high-quality synthetic training data for any SQL dialect, guided by readily available dialect-specific tutorials. SQL-GEN significantly improves cross-dialect Text-to-SQL performance, boosting execution accuracy by up to 20\% over existing methods. This performance gain narrows the gap with models trained on large-scale human-annotated data. Furthermore, combining synthetic data from SQL-GEN with human-annotated data yields additional improvements of up to 5.6\%. To unify multi-dialect capabilities within a single model, we propose a novel Mixture-of-Experts (MoE) initialization that leverages the shared knowledge across dialects. Our approach merges self-attention layers from dialect-specific models and initializes expert gates using dialect-specific keywords. This leads to a versatile model optimized for multiple SQL dialects, outperforming single-dialect models and significantly enhancing overall performance.
Paper Structure (55 sections, 4 equations, 12 figures, 13 tables, 1 algorithm)

This paper contains 55 sections, 4 equations, 12 figures, 13 tables, 1 algorithm.

Figures (12)

  • Figure 1: Exemplification of a question being answered using different SQL keywords for different dialects, BigQuery, PostgreSQL,and SQLite.
  • Figure 2: SQL-GEN to generate diverse and high-quality synthetic Text-to-SQL samples for any database.
  • Figure 3: An example of template expansion using BigQuery tutorials and seed templates.
  • Figure 4: Our proposed method to initialize one Transformer block of a MoE model from different dialect experts, exemplified here for Postgres, SQLite, and BigQuery dialects to create an all in one model to address all. Objects in yellow demonstrate multi-dialect models
  • Figure 5: Comparison between queries generated by our method with the baselines in terms of diversity of the SQL keywords and number of dialect-specific queries in each of them.
  • ...and 7 more figures