Table of Contents
Fetching ...

DBCopilot: Natural Language Querying over Massive Databases via Schema Routing

Tianshu Wang, Xiaoyang Chen, Hongyu Lin, Xianpei Han, Le Sun, Hao Wang, Zhenyu Zeng

TL;DR

DBCopilot tackles natural language querying over massive databases by decoupling schema routing from SQL generation. It introduces a compact copilot router built on a differentiable search index to map NL questions to target databases and tables, and uses a relation-aware, DFS-serialized schema graph to enable end-to-end routing. A reverse schema-to-question data synthesis pipeline trains the router without manual labeling, while various prompt strategies guide LLM-based SQL generation from routed schemata. Experimental results show significant gains in routing recall and end-to-end execution accuracy, demonstrating scalable NL2SQL performance on large, real-world schemata with minimal manual intervention.

Abstract

The development of Natural Language Interfaces to Databases (NLIDBs) has been greatly advanced by the advent of large language models (LLMs), which provide an intuitive way to translate natural language (NL) questions into Structured Query Language (SQL) queries. While significant progress has been made in LLM-based NL2SQL, existing approaches face several challenges in real-world scenarios of natural language querying over massive databases. In this paper, we present DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into schema routing and SQL generation. This framework utilizes a single lightweight differentiable search index to construct semantic mappings for massive database schemata, and navigates natural language questions to their target databases and tables in a relation-aware joint retrieval manner. The routed schemata and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results verify that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling natural language querying over massive databases for NLIDBs.

DBCopilot: Natural Language Querying over Massive Databases via Schema Routing

TL;DR

DBCopilot tackles natural language querying over massive databases by decoupling schema routing from SQL generation. It introduces a compact copilot router built on a differentiable search index to map NL questions to target databases and tables, and uses a relation-aware, DFS-serialized schema graph to enable end-to-end routing. A reverse schema-to-question data synthesis pipeline trains the router without manual labeling, while various prompt strategies guide LLM-based SQL generation from routed schemata. Experimental results show significant gains in routing recall and end-to-end execution accuracy, demonstrating scalable NL2SQL performance on large, real-world schemata with minimal manual intervention.

Abstract

The development of Natural Language Interfaces to Databases (NLIDBs) has been greatly advanced by the advent of large language models (LLMs), which provide an intuitive way to translate natural language (NL) questions into Structured Query Language (SQL) queries. While significant progress has been made in LLM-based NL2SQL, existing approaches face several challenges in real-world scenarios of natural language querying over massive databases. In this paper, we present DBCopilot, a framework that addresses these challenges by employing a compact and flexible copilot model for routing over massive databases. Specifically, DBCopilot decouples schema-agnostic NL2SQL into schema routing and SQL generation. This framework utilizes a single lightweight differentiable search index to construct semantic mappings for massive database schemata, and navigates natural language questions to their target databases and tables in a relation-aware joint retrieval manner. The routed schemata and questions are then fed into LLMs for effective SQL generation. Furthermore, DBCopilot introduces a reverse schema-to-question generation paradigm that can automatically learn and adapt the router over massive databases without manual intervention. Experimental results verify that DBCopilot is a scalable and effective solution for schema-agnostic NL2SQL, providing a significant advance in handling natural language querying over massive databases for NLIDBs.
Paper Structure (30 sections, 3 equations, 10 figures, 7 tables, 2 algorithms)

This paper contains 30 sections, 3 equations, 10 figures, 7 tables, 2 algorithms.

Figures (10)

  • Figure 1: DBCopilot employs LLM-Copilot collaboration to scaling natural language querying over massive databases. Schema routing first navigates to the appropriate database and tables for NL questions via a copilot model. SQL generation then transforms SQL queries via LLMs with schema-aware prompts.
  • Figure 2: Training process of schema router. We first construct a schema graph to represent the relationships of all databases and their tables. Based on this graph, we sample SQL query schemata and generate pseduo-questions using a reverse-trained schema questioning model that synthesizes training data. Finally, we apply DFS serialization and train the schema router.
  • Figure 3: An illustration of pseudo-question generation.
  • Figure 4: An example of graph-based constrained decoding. The underline (_) denotes the separator between schema elements, and a red cross indicates that the generation of the token is invalid and disallowed. We leverage diverse beam search to ensure a more varied set of candidate sequences, represented in different colors in this graph.
  • Figure 5: An example of a basic NL2SQL prompt.
  • ...and 5 more figures

Theorems & Definitions (6)

  • Definition 1: Schema-Agnostic NL2SQL
  • Definition 2: Schema Routing
  • Definition 3: SQL Generation
  • Example 1
  • Example 2
  • Example 3