Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai; Hao Liang; Chang Xu; Tao Xie; Wentao Zhang; Bin Cui

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

Qifeng Cai, Hao Liang, Chang Xu, Tao Xie, Wentao Zhang, Bin Cui

TL;DR

This work tackles data scarcity and low diversity in Text-to-SQL by introducing Text2SQL-Flow, a robust SQL-aware data augmentation framework that generates large-scale, semantically valid NL-SQL pairs from seeds across six augmentation dimensions. It builds SQLFlow, a 89,544-example dataset, and an accompanying Database Manager to ensure cross-database compatibility and scalable data synthesis, validated by an end-to-end pipeline including SQL execution verification, NL generation, CoT reasoning, and data classification. The authors show that fine-tuning open-source LLMs on SQLFlow consistently improves performance on standard benchmarks, while for closed-source models they propose a masked alignment retrieval method that uses SQLFlow as a knowledge base for structure-aware few-shot example selection, yielding strong retrieval results. Overall, the approach demonstrates the value of high-quality, structured data in Text-to-SQL and provides a scalable, data-centric foundation for advancing Text-to-SQL systems.

Abstract

The data-centric paradigm has become pivotal in AI, especially for Text-to-SQL, where performance is limited by scarce, simplistic, and low-diversity datasets. To address this, we propose Text2SQL-Flow, a SQL-aware data augmentation framework that generates large-scale, semantically valid, and structurally diverse Text-to-SQL pairs from minimal seed data. It operates across six augmentation dimensions and integrates an end-to-end pipeline featuring SQL execution verification, natural language question generation, chain-of-thought reasoning traces, and data classification. A modular Database Manager ensures cross-database compatibility and scalability. Using this framework, we build SQLFlow, a high-quality dataset of 89,544 annotated examples. We evaluate SQLFlow in two settings: (1) For open-source LLMs, fine-tuning on SQLFlow consistently improves performance across benchmarks under the same data budget. (2) For closed-source LLMs, we introduce a masked alignment retrieval method that treats SQLFlow as both knowledge base and training data for the retriever. This enables structure-aware example matching by modeling fine-grained alignments between questions and SQL queries. Experiments show our retrieval strategy outperforms existing methods, underscoring the value of SQLFlow's high-fidelity data and our novel technique. Our work establishes a scalable, data-centric foundation for advancing Text-to-SQL systems and highlights the critical role of high-quality structured data in modern AI.

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

TL;DR

Abstract

Text2SQL-Flow: A Robust SQL-Aware Data Augmentation Framework for Text-to-SQL

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (6)