Table of Contents
Fetching ...

Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL

Dingzirui Wang, Longxu Dou, Xuanliang Zhang, Qingfu Zhu, Wanxiang Che

TL;DR

This work tackles the limited diversity and high labeling cost of demonstrations in text-to-SQL using in-context learning. It defines Diversity Measurement (DM) to quantify demonstration pool diversity and proposes Fused, a human-free, iterative synthesis method that samples demonstrations from clusters and fuses them with LLMs to produce highly diverse demonstrations. Empirical results on Spider and KaggleDBQA show average improvements of $3.2\%$ with existing labeling and $5.0\%$ without labeling, validating both the DM metric and the effectiveness of Fused. The approach demonstrates potential for reducing labeling overhead while enhancing cross-domain adaptability of LLM-driven text-to-SQL systems. The work also provides insights into how diversity, iteration, and synthesis scale impact performance, highlighting practical guidance for deploying diverse demonstrations in real-world scenarios.

Abstract

Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.

Improving Demonstration Diversity by Human-Free Fusing for Text-to-SQL

TL;DR

This work tackles the limited diversity and high labeling cost of demonstrations in text-to-SQL using in-context learning. It defines Diversity Measurement (DM) to quantify demonstration pool diversity and proposes Fused, a human-free, iterative synthesis method that samples demonstrations from clusters and fuses them with LLMs to produce highly diverse demonstrations. Empirical results on Spider and KaggleDBQA show average improvements of with existing labeling and without labeling, validating both the DM metric and the effectiveness of Fused. The approach demonstrates potential for reducing labeling overhead while enhancing cross-domain adaptability of LLM-driven text-to-SQL systems. The work also provides insights into how diversity, iteration, and synthesis scale impact performance, highlighting practical guidance for deploying diverse demonstrations in real-world scenarios.

Abstract

Currently, the in-context learning method based on large language models (LLMs) has become the mainstream of text-to-SQL research. Previous works have discussed how to select demonstrations related to the user question from a human-labeled demonstration pool. However, human labeling suffers from the limitations of insufficient diversity and high labeling overhead. Therefore, in this paper, we discuss how to measure and improve the diversity of the demonstrations for text-to-SQL. We present a metric to measure the diversity of the demonstrations and analyze the insufficient of the existing labeled data by experiments. Based on the above discovery, we propose fusing iteratively for demonstrations (Fused) to build a high-diversity demonstration pool through human-free multiple-iteration synthesis, improving diversity and lowering label cost. Our method achieves an average improvement of 3.2% and 5.0% with and without human labeling on several mainstream datasets, which proves the effectiveness of Fused.
Paper Structure (57 sections, 2 equations, 8 figures, 9 tables)

This paper contains 57 sections, 2 equations, 8 figures, 9 tables.

Figures (8)

  • Figure 1: The comparison between the baseline (left) and Fused (right) of obtaining the demonstration pool for text-to-SQL. Fused can synthesize the demonstration pool from scratch or enhance the diversity of the existing labeling without additional human involvement.
  • Figure 2: Two demonstration pools with different DM. $\bullet$ represents the encoded demonstration, and ✖ represents the encoded user questions, in which the darkest denotes the user question with the least similarity to the most similar demonstration. The Euclidean distance between the user question and the most similar demonstration is indicated next to each line.
  • Figure 3: The pipeline of Fused, which consists of two steps: (i) Demonstration Sample: Sample demonstrations to be fused from the demonstration pool; (ii) Demonstration Fuse: Fuse the sampled demonstrations with the randomly sampled database. The representation of {database} is discussed in Appendix \ref{['app:prompts']}.
  • Figure 4: EX of $20$ different demonstration pools with different DM on the Spider dev set. Different points denote different pools containing $100$ demonstrations randomly sampled from the Spider train set.
  • Figure 5: DM and EX without values on the Spider dev set of CodeLlama-34b across different iterations with Fused. Turn $0$ denotes the origin demonstration pool without Fused. The sizes of the demonstration pools can be seen in Appendix \ref{['app:synthesis_number']}.
  • ...and 3 more figures