Structure-Grounded Pretraining for Text-to-SQL

Xiang Deng; Ahmed Hassan Awadallah; Christopher Meek; Oleksandr Polozov; Huan Sun; Matthew Richardson

Structure-Grounded Pretraining for Text-to-SQL

Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, Huan Sun, Matthew Richardson

TL;DR

StruG introduces a structure-grounded pretraining framework for text-to-SQL that leverages a parallel text-table corpus to learn text-table alignment through three objectives: column grounding, value grounding, and column-value mapping. By weakly supervising on ToTTo and using two labeling settings (human-assisted and automatic) plus data augmentation, StruG trains a BERT-style encoder that transfers to downstream text-to-SQL models like RAT-SQL, yielding significant improvements over BERT-LARGE and competitive results with GRAPPA, especially in more realistic Spider-Realistic evaluations. The method also demonstrates data efficiency, achieving strong performance with limited annotated data on WikiSQL, and reduces supervision costs by exploiting large parallel corpora. Overall, StruG advances practical cross-database text-to-SQL by explicitly learning text-table grounding, improving generalization when column names are not explicitly mentioned, and offering a scalable pretraining paradigm complementary to existing approaches.

Abstract

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

Structure-Grounded Pretraining for Text-to-SQL

TL;DR

Abstract

Structure-Grounded Pretraining for Text-to-SQL

TL;DR

Abstract

Paper Structure

Table of Contents

Figures (7)