Table of Contents
Fetching ...

Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance

Niklas Wretblad, Oskar Holmström, Erik Larsson, Axel Wiksäter, Oscar Söderlund, Hjalmar Öhman, Ture Pontén, Martin Forsberg, Martin Sörme, Fredrik Heintz

TL;DR

It is found that incorporating such generated column descriptions consistently enhances text-to-SQL model performance, particularly for larger models like GPT-4o, Qwen2 72B and Mixtral 22Bx8, suggesting that models benefit from more detailed metadata than humans expect.

Abstract

Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and text-to-SQL models. In this paper, we explore the use of large language models (LLMs) to automatically generate detailed natural language descriptions for SQL database columns, aiming to improve text-to-SQL performance and automate metadata creation. We create a dataset of gold column descriptions based on the BIRD-Bench benchmark, manually refining its column descriptions and creating a taxonomy for categorizing column difficulty. We then evaluate several different LLMs in generating column descriptions across the columns and different difficulties in the dataset, finding that models unsurprisingly struggle with columns that exhibit inherent ambiguity, highlighting the need for manual expert input. We also find that incorporating such generated column descriptions consistently enhances text-to-SQL model performance, particularly for larger models like GPT-4o, Qwen2 72B and Mixtral 22Bx8. Notably, Qwen2-generated descriptions, containing by annotators deemed superfluous information, outperform manually curated gold descriptions, suggesting that models benefit from more detailed metadata than humans expect. Future work will investigate the specific features of these high-performing descriptions and explore other types of metadata, such as numerical reasoning and synonyms, to further improve text-to-SQL systems. The dataset, annotations and code will all be made available.

Synthetic SQL Column Descriptions and Their Impact on Text-to-SQL Performance

TL;DR

It is found that incorporating such generated column descriptions consistently enhances text-to-SQL model performance, particularly for larger models like GPT-4o, Qwen2 72B and Mixtral 22Bx8, suggesting that models benefit from more detailed metadata than humans expect.

Abstract

Relational databases often suffer from uninformative descriptors of table contents, such as ambiguous columns and hard-to-interpret values, impacting both human users and text-to-SQL models. In this paper, we explore the use of large language models (LLMs) to automatically generate detailed natural language descriptions for SQL database columns, aiming to improve text-to-SQL performance and automate metadata creation. We create a dataset of gold column descriptions based on the BIRD-Bench benchmark, manually refining its column descriptions and creating a taxonomy for categorizing column difficulty. We then evaluate several different LLMs in generating column descriptions across the columns and different difficulties in the dataset, finding that models unsurprisingly struggle with columns that exhibit inherent ambiguity, highlighting the need for manual expert input. We also find that incorporating such generated column descriptions consistently enhances text-to-SQL model performance, particularly for larger models like GPT-4o, Qwen2 72B and Mixtral 22Bx8. Notably, Qwen2-generated descriptions, containing by annotators deemed superfluous information, outperform manually curated gold descriptions, suggesting that models benefit from more detailed metadata than humans expect. Future work will investigate the specific features of these high-performing descriptions and explore other types of metadata, such as numerical reasoning and synonyms, to further improve text-to-SQL systems. The dataset, annotations and code will all be made available.
Paper Structure (81 sections, 7 figures, 13 tables)

This paper contains 81 sections, 7 figures, 13 tables.

Figures (7)

  • Figure 1: Example rows from the district table in the BIRD-Bench benchmark dataset. This table illustrates a typical schema with uninformative column names such as A2-A16, making interpretation difficult without external documentation or domain knowledge. The column names provide little semantic meaning, and the accompanying data varying amount of information, complicating their use in text-to-SQL query generation and requiring additional metadata for effective database interactions.
  • Figure 2: Visual representation of LLM performance in generating SQL column descriptions.
  • Figure 3: Execution accuracy with and without gold column descriptions on completely uninformative column names.
  • Figure 4: A decision tree designed to help annotators in deciding the quality of the generated column descriptions.
  • Figure 5: The prompt used for generating columns, given only the schema and sample data from the database.
  • ...and 2 more figures