Table of Contents
Fetching ...

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

TL;DR

This work tackles data scarcity and low diversity in Text-to-SQL by introducing Text2SQL-Flow, a robust SQL-aware data augmentation framework that generates large-scale, semantically valid NL-SQL pairs from seeds across six augmentation dimensions. It builds SQLFlow, a 89,544-example dataset, and an accompanying Database Manager to ensure cross-database compatibility and scalable data synthesis, validated by an end-to-end pipeline including SQL execution verification, NL generation, CoT reasoning, and data classification. The authors show that fine-tuning open-source LLMs on SQLFlow consistently improves performance on standard benchmarks, while for closed-source models they propose a masked alignment retrieval method that uses SQLFlow as a knowledge base for structure-aware few-shot example selection, yielding strong retrieval results. Overall, the approach demonstrates the value of high-quality, structured data in Text-to-SQL and provides a scalable, data-centric foundation for advancing Text-to-SQL systems.

Abstract

The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

TL;DR

This work tackles data scarcity and low diversity in Text-to-SQL by introducing Text2SQL-Flow, a robust SQL-aware data augmentation framework that generates large-scale, semantically valid NL-SQL pairs from seeds across six augmentation dimensions. It builds SQLFlow, a 89,544-example dataset, and an accompanying Database Manager to ensure cross-database compatibility and scalable data synthesis, validated by an end-to-end pipeline including SQL execution verification, NL generation, CoT reasoning, and data classification. The authors show that fine-tuning open-source LLMs on SQLFlow consistently improves performance on standard benchmarks, while for closed-source models they propose a masked alignment retrieval method that uses SQLFlow as a knowledge base for structure-aware few-shot example selection, yielding strong retrieval results. Overall, the approach demonstrates the value of high-quality, structured data in Text-to-SQL and provides a scalable, data-centric foundation for advancing Text-to-SQL systems.

Abstract

The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

Paper Structure

This paper contains 29 sections, 9 equations, 6 figures, 4 tables.

Figures (6)

  • Figure 1: Performance comparison of models trained on SQLFlow vs. SynSQL li2025omnisql (both at 89,544 samples). Blue labels indicate results for fine-tuned open-source models, while green labels indicate results of closed-source model with few-shot retrieval for prompt construction.
  • Figure 2: Overall framework of our work.
  • Figure 3: Augmented data utilization for open-source and closed-source LLMs
  • Figure 4: Comparison of visualization between original SQLs and augmented SQLs using our framework.
  • Figure 5: Performance of question–SQL alignment strategy retrieval models trained on different datasets. No masking operations are applied during the retrieval process. "SynSQL-part*" and "SQLFlow-part*" denote the corresponding datasets combined with Spider-train and BIRD-train.
  • ...and 1 more figures