Synthesizing Text-to-SQL Data from Weak and Strong LLMs
Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou
TL;DR
This work addresses the gap between open-source and closed-source LLMs in text-to-SQL by introducing Sense, a data synthesis framework that combines strong-model data for domain diversity with weak-model error data guided by an executor and preference learning. Sense fine-tunes open-source CodeLLaMA bases via supervised learning on synthetic strong data and refines outputs through direct preference optimization on weak data, achieving state-of-the-art results on Spider and competitive performance on the challenging BIRD benchmark, as well as robustness across SYN, REALISTIC, and DK. The approach demonstrates that carefully crafted synthetic data can substantially narrow the performance gap and enable open-source LLMs to operate effectively in real-world, cross-domain SQL tasks, with publicly released data and models to spur further progress. Overall, the paper contributes a practical open-source pathway for high-quality text-to-SQL systems and provides detailed ablations and analyses of strong versus weak data, transferability, and robustness across tasks and domains.
Abstract
The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.
