Table of Contents
Fetching ...

Structure-Grounded Pretraining for Text-to-SQL

Xiang Deng, Ahmed Hassan Awadallah, Christopher Meek, Oleksandr Polozov, Huan Sun, Matthew Richardson

TL;DR

StruG introduces a structure-grounded pretraining framework for text-to-SQL that leverages a parallel text-table corpus to learn text-table alignment through three objectives: column grounding, value grounding, and column-value mapping. By weakly supervising on ToTTo and using two labeling settings (human-assisted and automatic) plus data augmentation, StruG trains a BERT-style encoder that transfers to downstream text-to-SQL models like RAT-SQL, yielding significant improvements over BERT-LARGE and competitive results with GRAPPA, especially in more realistic Spider-Realistic evaluations. The method also demonstrates data efficiency, achieving strong performance with limited annotated data on WikiSQL, and reduces supervision costs by exploiting large parallel corpora. Overall, StruG advances practical cross-database text-to-SQL by explicitly learning text-table grounding, improving generalization when column names are not explicitly mentioned, and offering a scalable pretraining paradigm complementary to existing approaches.

Abstract

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

Structure-Grounded Pretraining for Text-to-SQL

TL;DR

StruG introduces a structure-grounded pretraining framework for text-to-SQL that leverages a parallel text-table corpus to learn text-table alignment through three objectives: column grounding, value grounding, and column-value mapping. By weakly supervising on ToTTo and using two labeling settings (human-assisted and automatic) plus data augmentation, StruG trains a BERT-style encoder that transfers to downstream text-to-SQL models like RAT-SQL, yielding significant improvements over BERT-LARGE and competitive results with GRAPPA, especially in more realistic Spider-Realistic evaluations. The method also demonstrates data efficiency, achieving strong performance with limited annotated data on WikiSQL, and reduces supervision costs by exploiting large parallel corpora. Overall, StruG advances practical cross-database text-to-SQL by explicitly learning text-table grounding, improving generalization when column names are not explicitly mentioned, and offering a scalable pretraining paradigm complementary to existing approaches.

Abstract

Learning to capture text-table alignment is essential for tasks like text-to-SQL. A model needs to correctly recognize natural language references to columns and values and to ground them in the given database schema. In this paper, we present a novel weakly supervised Structure-Grounded pretraining framework (StruG) for text-to-SQL that can effectively learn to capture text-table alignment based on a parallel text-table corpus. We identify a set of novel prediction tasks: column grounding, value grounding and column-value mapping, and leverage them to pretrain a text-table encoder. Additionally, to evaluate different methods under more realistic text-table alignment settings, we create a new evaluation set Spider-Realistic based on Spider dev set with explicit mentions of column names removed, and adopt eight existing text-to-SQL datasets for cross-database evaluation. STRUG brings significant improvement over BERT-LARGE in all settings. Compared with existing pretraining methods such as GRAPPA, STRUG achieves similar performance on Spider, and outperforms all baselines on more realistic sets. The Spider-Realistic dataset is available at https://doi.org/10.5281/zenodo.5205322.

Paper Structure

This paper contains 18 sections, 1 equation, 7 figures, 10 tables.

Figures (7)

  • Figure 1: Illustration of text-to-SQL text-table alignment (top half) and parallel text-table corpus (bottom half). In both examples, the associations between tokens in the NL utterance and columns in the table are indicated. In this paper, we aim to leverage the text-table alignment knowledge in the parallel text-table corpus to help text-to-SQL.
  • Figure 2: Overview of our model architecture and three pretraining objectives.
  • Figure 3: Illustration of the parallel corpus ToTTo parikh2020totto and our two weakly supervised pretraining settings. Cell highlighted with yellow are the cell annotations provided by ToTTo, and cell highlighted with dashed lines are cell annotations obtained via string matching in automatic setting.
  • Figure 4: Execution Accuracy on the WikiSQL test set with different fractions of training data.
  • Figure 5: Execution Accuracy on the WikiSQL dev set during training with 5% of training data.
  • ...and 2 more figures