Table of Contents
Fetching ...

Automatic database description generation for Text-to-SQL

Yingqi Gao, Zhiling Luo

TL;DR

This work addresses the cold-start problem in NL2SQL by automatically generating table and column descriptions when explicit metadata is unavailable. It introduces a dual-process framework that combines a coarse-to-fine pass guided by LLM knowledge with a subsequent fine-to-coarse pass, integrating both to produce coherent $db\_info$, $table\_info$, $column\_info$, and $column\_relation$ that enrich the M-Schema. Column classification into Code, Enum, DateTime, and Text, plus constrained column descriptions (<$20 words) and table descriptions (<$100 words), enables more accurate schema linking and SQL generation. Experimental results on the Bird benchmark show an average improvement of about $0.93\%$ over no descriptions and around $39\%$ of manual-level performance, highlighting practical benefits for NL2SQL systems. The authors provide publicly available code and support for SQLite, MySQL, PostgreSQL, facilitating integration into downstream NL2SQL pipelines.

Abstract

In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93\% compared to not using descriptions, and achieves 37\% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.

Automatic database description generation for Text-to-SQL

TL;DR

This work addresses the cold-start problem in NL2SQL by automatically generating table and column descriptions when explicit metadata is unavailable. It introduces a dual-process framework that combines a coarse-to-fine pass guided by LLM knowledge with a subsequent fine-to-coarse pass, integrating both to produce coherent , , , and that enrich the M-Schema. Column classification into Code, Enum, DateTime, and Text, plus constrained column descriptions (<100 words), enables more accurate schema linking and SQL generation. Experimental results on the Bird benchmark show an average improvement of about over no descriptions and around of manual-level performance, highlighting practical benefits for NL2SQL systems. The authors provide publicly available code and support for SQLite, MySQL, PostgreSQL, facilitating integration into downstream NL2SQL pipelines.

Abstract

In the context of the Text-to-SQL task, table and column descriptions are crucial for bridging the gap between natural language and database schema. This report proposes a method for automatically generating effective database descriptions when explicit descriptions are unavailable. The proposed method employs a dual-process approach: a coarse-to-fine process, followed by a fine-to-coarse process. The coarse-to-fine approach leverages the inherent knowledge of LLM to guide the understanding process from databases to tables and finally to columns. This approach provides a holistic understanding of the database structure and ensures contextual alignment. Conversely, the fine-to-coarse approach starts at the column level, offering a more accurate and nuanced understanding when stepping back to the table level. Experimental results on the Bird benchmark indicate that using descriptions generated by the proposed improves SQL generation accuracy by 0.93\% compared to not using descriptions, and achieves 37\% of human-level performance. The source code is publicly available at https://github.com/XGenerationLab/XiYan-DBDescGen.

Paper Structure

This paper contains 12 sections, 2 figures, 1 table.

Figures (2)

  • Figure 1: The workflow of proposed database description generation method.
  • Figure 2: The impact of different database descriptions on SQL generation.