Table of Contents
Fetching ...

Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection

Chuhong Mai, Ro-ee Tal, Thahir Mohamed

TL;DR

The proposed technique, dubbed MARLO - Metadata-Agnostic Representation Learning for Text-tO-SQL - uses query structure to model querying intent without over-indexing on underlying database metadata without over-indexing on underlying database metadata.

Abstract

In-context learning (ICL) is a powerful paradigm where large language models (LLMs) benefit from task demonstrations added to the prompt. Yet, selecting optimal demonstrations is not trivial, especially for complex or multi-modal tasks where input and output distributions differ. We hypothesize that forming task-specific representations of the input is key. In this paper, we propose a method to align representations of natural language questions and those of SQL queries in a shared embedding space. Our technique, dubbed MARLO - Metadata-Agnostic Representation Learning for Text-tO-SQL - uses query structure to model querying intent without over-indexing on underlying database metadata (i.e. tables, columns, or domain-specific entities of a database referenced in the question or query). This allows MARLO to select examples that are structurally and semantically relevant for the task rather than examples that are spuriously related to a certain domain or question phrasing. When used to retrieve examples based on question similarity, MARLO shows superior performance compared to generic embedding models (on average +2.9\%pt. in execution accuracy) on the Spider benchmark. It also outperforms the next best method that masks metadata information by +0.8\%pt. in execution accuracy on average, while imposing a significantly lower inference latency.

Learning Metadata-Agnostic Representations for Text-to-SQL In-Context Example Selection

TL;DR

The proposed technique, dubbed MARLO - Metadata-Agnostic Representation Learning for Text-tO-SQL - uses query structure to model querying intent without over-indexing on underlying database metadata without over-indexing on underlying database metadata.

Abstract

In-context learning (ICL) is a powerful paradigm where large language models (LLMs) benefit from task demonstrations added to the prompt. Yet, selecting optimal demonstrations is not trivial, especially for complex or multi-modal tasks where input and output distributions differ. We hypothesize that forming task-specific representations of the input is key. In this paper, we propose a method to align representations of natural language questions and those of SQL queries in a shared embedding space. Our technique, dubbed MARLO - Metadata-Agnostic Representation Learning for Text-tO-SQL - uses query structure to model querying intent without over-indexing on underlying database metadata (i.e. tables, columns, or domain-specific entities of a database referenced in the question or query). This allows MARLO to select examples that are structurally and semantically relevant for the task rather than examples that are spuriously related to a certain domain or question phrasing. When used to retrieve examples based on question similarity, MARLO shows superior performance compared to generic embedding models (on average +2.9\%pt. in execution accuracy) on the Spider benchmark. It also outperforms the next best method that masks metadata information by +0.8\%pt. in execution accuracy on average, while imposing a significantly lower inference latency.

Paper Structure

This paper contains 46 sections, 3 equations, 6 figures, 11 tables, 1 algorithm.

Figures (6)

  • Figure 1: Motivation of this work. From the perspective of generic sentence embeddings, the left question is similar to the middle one but dissimilar from the one on the right. MARLO focuses on query structure (rather than metadata specifics) to represent the intent of each question more accurately. This allows it to retrieve the more instructive demonstration (rightmost). For emphasis, noun chunks, parts-of-speech, and domain information specific to the database metadata are annotated accordingly.
  • Figure 2: Semi-asymmetric bi-encoder architecture.. Parameters in the base transformer and the pooling layer are shared while two separate dense layers are trained to align question and SQL query embeddings, respectively.
  • Figure 3: Execution accuracy (%) of MARLO for various numbers of selected demonstrations. Performance initially increases and then plateaus as more demonstrations are included in the context, implying in-context learning scaling limitations for this task.
  • Figure 4: Architectures choices for bi-encoders. A symmetric architecture (a) have parameters shared in all three modules while an asymmetric architectures (c) does not share any layer between the two towers. Our work adopted a semi-asymmetric structure where a common backbone transformer and pooling layer are shared, but dense layers are separated.
  • Figure 5: Execution accuracy (%) on Spider-dev by difficulty level. With Claude 2.1 and Mistral Large, MARLO outperforms all other demonstration selection methods across all difficulty levels, particularly on more difficult questions, implying the examples to selects contributes to better question understanding in more complex settings.
  • ...and 1 more figures