Table of Contents
Fetching ...

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

Jiaxi Yang, Binyuan Hui, Min Yang, Jian Yang, Junyang Lin, Chang Zhou

TL;DR

This work addresses the gap between open-source and closed-source LLMs in text-to-SQL by introducing Sense, a data synthesis framework that combines strong-model data for domain diversity with weak-model error data guided by an executor and preference learning. Sense fine-tunes open-source CodeLLaMA bases via supervised learning on synthetic strong data and refines outputs through direct preference optimization on weak data, achieving state-of-the-art results on Spider and competitive performance on the challenging BIRD benchmark, as well as robustness across SYN, REALISTIC, and DK. The approach demonstrates that carefully crafted synthetic data can substantially narrow the performance gap and enable open-source LLMs to operate effectively in real-world, cross-domain SQL tasks, with publicly released data and models to spur further progress. Overall, the paper contributes a practical open-source pathway for high-quality text-to-SQL systems and provides detailed ablations and analyses of strong versus weak data, transferability, and robustness across tasks and domains.

Abstract

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.

Synthesizing Text-to-SQL Data from Weak and Strong LLMs

TL;DR

This work addresses the gap between open-source and closed-source LLMs in text-to-SQL by introducing Sense, a data synthesis framework that combines strong-model data for domain diversity with weak-model error data guided by an executor and preference learning. Sense fine-tunes open-source CodeLLaMA bases via supervised learning on synthetic strong data and refines outputs through direct preference optimization on weak data, achieving state-of-the-art results on Spider and competitive performance on the challenging BIRD benchmark, as well as robustness across SYN, REALISTIC, and DK. The approach demonstrates that carefully crafted synthetic data can substantially narrow the performance gap and enable open-source LLMs to operate effectively in real-world, cross-domain SQL tasks, with publicly released data and models to spur further progress. Overall, the paper contributes a practical open-source pathway for high-quality text-to-SQL systems and provides detailed ablations and analyses of strong versus weak data, transferability, and robustness across tasks and domains.

Abstract

The capability gap between open-source and closed-source large language models (LLMs) remains a challenge in text-to-SQL tasks. In this paper, we introduce a synthetic data approach that combines data produced by larger, more powerful models (strong models) with error information data generated by smaller, not well-aligned models (weak models). The method not only enhances the domain generalization of text-to-SQL models but also explores the potential of error data supervision through preference learning. Furthermore, we employ the synthetic data approach for instruction tuning on open-source LLMs, resulting SENSE, a specialized text-to-SQL model. The effectiveness of SENSE is demonstrated through state-of-the-art results on the SPIDER and BIRD benchmarks, bridging the performance gap between open-source models and methods prompted by closed-source models.
Paper Structure (32 sections, 3 equations, 7 figures, 6 tables)

This paper contains 32 sections, 3 equations, 7 figures, 6 tables.

Figures (7)

  • Figure 1: Overview of Sense: Integrating human-annotated data with synthetic data from strong models for domain diversity, and weak models for preference learning, aligning with executors for enhanced text-to-SQL performance.
  • Figure 2: Unified prompt chang2023prompt template for text-to-SQL tasks.
  • Figure 3: Prompt for synthesizing strong data. The placeholder the_level is filled on-the-fly by program, controlling the desired hardness level of the generated data point. For limited token consideration, we randomly draw two examples from Spider training set as few-shot demonstrations.
  • Figure 4: Domain density comparison. This visualization sorts domains by example count, showcasing a long-tail distribution to highlight the broad diversity within our synthetic dataset.
  • Figure 5: 2-D t-SNE visualization comparing original and synthetic data's last-layer hidden representations post-supervised fine-tuning on last token.
  • ...and 2 more figures