Table of Contents
Fetching ...

UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

Poojah Ganesan, Rajat Aayush Jha, Dan Roth, Vivek Gupta

TL;DR

UNJOIN addresses the challenge of multi-table Text-to-SQL by decoupling schema element retrieval from SQL logic through a two-stage framework. It first flattens a multi-table schema into a single-table form (Schema Simplification), then generates an intermediate simplified SQL and translates it back to executable SQL on the original schema (Query Translation). The approach relies solely on schema information, avoiding data access or fine-tuning, and demonstrates strong performance on Spider and BIRD across closed-book and open-book settings, including end-to-end table QA scenarios. By reducing compounding errors via decomposition and providing a practical mapping back to the original schema, UNJOIN improves table/column retrieval, JOIN/UNION construction, and generalization across diverse schemas. The results suggest it is a scalable, plug-in module that enhances multi-table Text-to-SQL in real-world database contexts.

Abstract

Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.

UNJOIN: Enhancing Multi-Table Text-to-SQL Generation via Schema Simplification

TL;DR

UNJOIN addresses the challenge of multi-table Text-to-SQL by decoupling schema element retrieval from SQL logic through a two-stage framework. It first flattens a multi-table schema into a single-table form (Schema Simplification), then generates an intermediate simplified SQL and translates it back to executable SQL on the original schema (Query Translation). The approach relies solely on schema information, avoiding data access or fine-tuning, and demonstrates strong performance on Spider and BIRD across closed-book and open-book settings, including end-to-end table QA scenarios. By reducing compounding errors via decomposition and providing a practical mapping back to the original schema, UNJOIN improves table/column retrieval, JOIN/UNION construction, and generalization across diverse schemas. The results suggest it is a scalable, plug-in module that enhances multi-table Text-to-SQL in real-world database contexts.

Abstract

Recent advances in large language models (LLMs) have greatly improved Text-to-SQL performance for single-table queries. But, it remains challenging in multi-table databases due to complex schema and relational operations. Existing methods often struggle with retrieving the right tables and columns, generating accurate JOINs and UNIONs, and generalizing across diverse schemas. To address these issues, we introduce UNJOIN, a two-stage framework that decouples the retrieval of schema elements from SQL logic generation. In the first stage, we merge the column names of all tables in the database into a single-table representation by prefixing each column with its table name. This allows the model to focus purely on accurate retrieval without being distracted by the need to write complex SQL logic. In the second stage, the SQL query is generated on this simplified schema and mapped back to the original schema by reconstructing JOINs, UNIONs, and relational logic. Evaluations on SPIDER and BIRD datasets show that UNJOIN matches or exceeds the state-of-the-art baselines. UNJOIN uses only schema information, which does not require data access or fine-tuning, making it scalable and adaptable across databases.

Paper Structure

This paper contains 43 sections, 4 figures, 11 tables, 1 algorithm.

Figures (4)

  • Figure 1: Baseline vs UnJoin
  • Figure 2: Our Proposed Method. After schema conversion, the resulting single-table schema contains a total of 54 columns obtained by combining the columns from all shown tables: order, account, district, disp, card, client, loan, and trans. Due to space constraints, the complete single-table representation is not shown.
  • Figure 3: Retrieval accuracy Performance with Increasing Number of Tables
  • Figure 4: Schema Simplification