Table of Contents
Fetching ...

SQUiD: Synthesizing Relational Databases from Unstructured Text

Mushtari Sadia, Zhenning Yang, Yunming Xiao, Ang Chen, Amrita Roy Chowdhury

TL;DR

This paper defines Text2R, the task of synthesizing a relational database from unstructured text, and introduces SQUiD, a neurosymbolic four-stage framework that partitions the problem into schema generation, value identification, table population, and database materialization. By combining symbolic information extraction with LLM-guided methods and programmatic SQL generation, SQUiD achieves robust, schema-consistent databases that are then materialized in SQLite for deterministic evaluation. The authors construct automated benchmarks (BIRD and Kaggle-based datasets) and propose a comprehensive metric suite to evaluate schema and data fidelity, demonstrating consistent improvements over zero-shot baselines across diverse domains and model sizes. The work advances end-to-end data integration from narrative text, with practical implications for scalable data curation and cross-domain information management.

Abstract

Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.

SQUiD: Synthesizing Relational Databases from Unstructured Text

TL;DR

This paper defines Text2R, the task of synthesizing a relational database from unstructured text, and introduces SQUiD, a neurosymbolic four-stage framework that partitions the problem into schema generation, value identification, table population, and database materialization. By combining symbolic information extraction with LLM-guided methods and programmatic SQL generation, SQUiD achieves robust, schema-consistent databases that are then materialized in SQLite for deterministic evaluation. The authors construct automated benchmarks (BIRD and Kaggle-based datasets) and propose a comprehensive metric suite to evaluate schema and data fidelity, demonstrating consistent improvements over zero-shot baselines across diverse domains and model sizes. The work advances end-to-end data integration from narrative text, with practical implications for scalable data curation and cross-domain information management.

Abstract

Relational databases are central to modern data management, yet most data exists in unstructured forms like text documents. To bridge this gap, we leverage large language models (LLMs) to automatically synthesize a relational database by generating its schema and populating its tables from raw text. We introduce SQUiD, a novel neurosymbolic framework that decomposes this task into four stages, each with specialized techniques. Our experiments show that SQUiD consistently outperforms baselines across diverse datasets.

Paper Structure

This paper contains 31 sections, 11 equations, 21 figures, 8 tables.

Figures (21)

  • Figure 1: Challenges of synthesizing relational DB from text
  • Figure 2: Overview of SQUiD. (1) Schema Generation constructs a relational schema that defines the tables, columns, and their relationships, from the entities in the text. (2) Value Identification extracts relevant values (e.g., names, dates) from the text. These values are then organized during (3) Table Population by aligning them with the generated schema to form tuples. (4) Database Materialization programmatically translates the output into SQL statements, producing the final relational database.
  • Figure 3: Closest related works—T3deng2024texttupletableinformationintegrationtexttotable, StructSumjain2024structsumgenerationfastertext, and EvaporateAroravldb—when applied to our example dataset, either produced a single table with incorrect column-value assignments or multiple disconnected, irrelevant tables. In contrast, as shown in Fig.\ref{['fig:e2e']}, SQUiD correctly generates all five tables corresponding to the entities (Traveler, Trip, Accommodation, Transportation and Destination) along with their proper relationships.
  • Figure 4: Examples of valid versus invalid relational schemas. PK: Primary key; FK: Foreign key.
  • Figure 5: Our dataset generation process
  • ...and 16 more figures