Table of Contents
Fetching ...

Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL

Geling Liu, Yunzhi Tan, Ruichao Zhong, Yuanzhen Xie, Lingchen Zhao, Qian Wang, Bo Hu, Zang Li

TL;DR

Solid-SQL tackles robustness gaps in LLM-based text-to-SQL by introducing a robust pre-processing pipeline that augments training data, refines schema linking, and employs skeleton-based in-context example retrieval with an explicit attention mechanism. The method is designed as a plug-in that supports multiple LLMs and uses a two-round in-context learning process to stabilize SQL generation under perturbations, formalized as $S = M(Q,SC)$ with robustness requiring $DB(M(Q,SC)) = DB(M(Q^*,SC))$. Key contributions include robust data augmentation for schema linking, a fine-tuned schema-linking model, skeleton-based question and SQL matching strategies, and an attention-guided prompt design, all supported by extensive ablations. Empirically, Solid-SQL achieves state-of-the-art execution accuracy on general benchmarks ($EX$ up to $82.1\%$ on Spider and $58.9\%$ on Bird) and yields an average robustness improvement of $11.6\%$ over baselines on perturbed datasets, demonstrating practical improvements for reliable text-to-SQL in adversarial settings.

Abstract

Recently, large language models (LLMs) have significantly improved the performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness. Our experiments reveal that while LLM-driven methods excel on standard datasets, their accuracy is notably compromised when faced with adversarial perturbations. To address this challenge, we propose a robust text-to-SQL solution, called Solid-SQL, designed to integrate with various LLMs. We focus on the pre-processing stage, training a robust schema-linking model enhanced by LLM-based data augmentation. Additionally, we design a two-round, structural similarity-based example retrieval strategy for in-context learning. Our method achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general Spider and Bird benchmarks, respectively. Furthermore, experimental results show that Solid-SQL delivers an average improvement of 11.6% compared to baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider benchmarks.

Solid-SQL: Enhanced Schema-linking based In-context Learning for Robust Text-to-SQL

TL;DR

Solid-SQL tackles robustness gaps in LLM-based text-to-SQL by introducing a robust pre-processing pipeline that augments training data, refines schema linking, and employs skeleton-based in-context example retrieval with an explicit attention mechanism. The method is designed as a plug-in that supports multiple LLMs and uses a two-round in-context learning process to stabilize SQL generation under perturbations, formalized as with robustness requiring . Key contributions include robust data augmentation for schema linking, a fine-tuned schema-linking model, skeleton-based question and SQL matching strategies, and an attention-guided prompt design, all supported by extensive ablations. Empirically, Solid-SQL achieves state-of-the-art execution accuracy on general benchmarks ( up to on Spider and on Bird) and yields an average robustness improvement of over baselines on perturbed datasets, demonstrating practical improvements for reliable text-to-SQL in adversarial settings.

Abstract

Recently, large language models (LLMs) have significantly improved the performance of text-to-SQL systems. Nevertheless, many state-of-the-art (SOTA) approaches have overlooked the critical aspect of system robustness. Our experiments reveal that while LLM-driven methods excel on standard datasets, their accuracy is notably compromised when faced with adversarial perturbations. To address this challenge, we propose a robust text-to-SQL solution, called Solid-SQL, designed to integrate with various LLMs. We focus on the pre-processing stage, training a robust schema-linking model enhanced by LLM-based data augmentation. Additionally, we design a two-round, structural similarity-based example retrieval strategy for in-context learning. Our method achieves SOTA SQL execution accuracy levels of 82.1% and 58.9% on the general Spider and Bird benchmarks, respectively. Furthermore, experimental results show that Solid-SQL delivers an average improvement of 11.6% compared to baselines on the perturbed Spider-Syn, Spider-Realistic, and Dr. Spider benchmarks.

Paper Structure

This paper contains 28 sections, 2 figures, 6 tables.

Figures (2)

  • Figure 1: The general three-stage pipeline of LLM-based text-to-SQL systems.
  • Figure 2: The pipeline of Solid-SQL.