Table of Contents
Fetching ...

SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder

Mohammadreza Pourreza, Davood Rafiei, Yuxi Feng, Raymond Li, Zhenan Fan, Weiwei Zhang

TL;DR

This paper uses a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model, and demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics.

Abstract

Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.

SQL-Encoder: Improving NL2SQL In-Context Learning Through a Context-Aware Encoder

TL;DR

This paper uses a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model, and demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics.

Abstract

Detecting structural similarity between queries is essential for selecting examples in in-context learning models. However, assessing structural similarity based solely on the natural language expressions of queries, without considering SQL queries, presents a significant challenge. This paper explores the significance of this similarity metric and proposes a model for accurately estimating it. To achieve this, we leverage a dataset comprising 170k question pairs, meticulously curated to train a similarity prediction model. Our comprehensive evaluation demonstrates that the proposed model adeptly captures the structural similarity between questions, as evidenced by improvements in Kendall-Tau distance and precision@k metrics. Notably, our model outperforms strong competitive embedding models from OpenAI and Cohere. Furthermore, compared to these competitive models, our proposed encoder enhances the downstream performance of NL2SQL models in 1-shot in-context learning scenarios by 1-2\% for GPT-3.5-turbo, 4-8\% for CodeLlama-7B, and 2-3\% for CodeLlama-13B.
Paper Structure (21 sections, 1 equation, 3 figures, 7 tables)

This paper contains 21 sections, 1 equation, 3 figures, 7 tables.

Figures (3)

  • Figure 1: Overview of SQL-encoder framework for predicting the similarity between two questions $N_1$ and $N_2$ on a given database schema D, where SIM' serves as a proxy for the actual similarity SIM between the questions $N_1$ and $N_2$, their respective SQL queries $Q_1$ and $Q_2$ and schema links $S_1$ and $S_2$.
  • Figure 2: An example demonstrating the utilization of different similarity metrics to find the most similar Question/SQL pair.
  • Figure 3: An example of the process to construct a tree from SQL query after masking the schema mentions.