Table of Contents
Fetching ...

Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language

Qin Wang, Youhuan Li, Yansong Feng, Si Chen, Ziming Li, Pan Zhang, Zihui Si, Yixuan Chen, Zhichao Shi, Zebin Huang, Guo Chen, Wenqiang Jin

TL;DR

This work defines Text2Schema, a task that automatically generates a database schema from natural language requirements, addressing a gap where Text2SQL assumes pre-designed schemas. It introduces SchemaAgent, an LLM-based multi-agent framework with six specialized roles and a group chat communication mechanism that includes an error-detection and correction loop to minimize cascading mistakes. A formal database schema definition is proposed, and the RSchema benchmark with 381 requirement–schema pairs is released to support evaluation. Experimental results show SchemaAgent outperforms several baselines across multiple metrics, showcasing its potential to empower non-technical users to produce robust, 3NF database designs and enabling downstream Text2SQL usage for data manipulation.

Abstract

People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but require a high level of professional expertise from users. Although there are already many works on Text2SQL to automate the translation of natural language into SQL queries for data manipulation, all of them presuppose that the database schema is pre-designed. In practice, schema design itself demands domain expertise, and research on directly generating schemas from textual requirements remains unexplored. In this paper, we systematically define a new problem, called Text2Schema, to convert a natural language text requirement into a relational database schema. With an effective Text2Schema technique, users can effortlessly create database table structures using natural language, and subsequently leverage existing Text2SQL techniques to perform data manipulations, which significantly narrows the gap between non-technical personnel and highly efficient, versatile relational database systems. We propose SchemaAgent, an LLM-based multi-agent framework for Text2Schema. We emulate the workflow of manual schema design by assigning specialized roles to agents and enabling effective collaboration to refine their respective subtasks. We also incorporate dedicated roles for reflection and inspection, along with an innovative error detection and correction mechanism to identify and rectify issues across various phases. Moreover, we build and open source a benchmark containing 381 pairs of requirement description and schema. Experimental results demonstrate the superiority of our approach over comparative work.

Text2Schema: Filling the Gap in Designing Database Table Structures based on Natural Language

TL;DR

This work defines Text2Schema, a task that automatically generates a database schema from natural language requirements, addressing a gap where Text2SQL assumes pre-designed schemas. It introduces SchemaAgent, an LLM-based multi-agent framework with six specialized roles and a group chat communication mechanism that includes an error-detection and correction loop to minimize cascading mistakes. A formal database schema definition is proposed, and the RSchema benchmark with 381 requirement–schema pairs is released to support evaluation. Experimental results show SchemaAgent outperforms several baselines across multiple metrics, showcasing its potential to empower non-technical users to produce robust, 3NF database designs and enabling downstream Text2SQL usage for data manipulation.

Abstract

People without a database background usually rely on file systems or tools such as Excel for data management, which often lead to redundancy and data inconsistency. Relational databases possess strong data management capabilities, but require a high level of professional expertise from users. Although there are already many works on Text2SQL to automate the translation of natural language into SQL queries for data manipulation, all of them presuppose that the database schema is pre-designed. In practice, schema design itself demands domain expertise, and research on directly generating schemas from textual requirements remains unexplored. In this paper, we systematically define a new problem, called Text2Schema, to convert a natural language text requirement into a relational database schema. With an effective Text2Schema technique, users can effortlessly create database table structures using natural language, and subsequently leverage existing Text2SQL techniques to perform data manipulations, which significantly narrows the gap between non-technical personnel and highly efficient, versatile relational database systems. We propose SchemaAgent, an LLM-based multi-agent framework for Text2Schema. We emulate the workflow of manual schema design by assigning specialized roles to agents and enabling effective collaboration to refine their respective subtasks. We also incorporate dedicated roles for reflection and inspection, along with an innovative error detection and correction mechanism to identify and rectify issues across various phases. Moreover, we build and open source a benchmark containing 381 pairs of requirement description and schema. Experimental results demonstrate the superiority of our approach over comparative work.

Paper Structure

This paper contains 31 sections, 3 equations, 11 figures, 3 tables.

Figures (11)

  • Figure 1: Interactions between ordinary users and databases.
  • Figure 2: The process for database schema design. The 1st phase is requirements analysis. The 2nd is conceptual design phase for an Entity-Relationship (ER) model. The 3rd is logical design phase mapping ER model into logical schema.
  • Figure 3: Structural overview of our RSchema database. The central area illustrates the distribution of domains. The left panel presents an example schema derived from a specific domain. The upper right panel depicts the distribution of the number of tables per schema, while the lower right panel shows the distribution of the number of columns per table, providing insights into the schema complexity and granularity.
  • Figure 4: Our SchemaAgent framework uses LLM-based agents with six distinct roles, including Project manager, Conceptual model designer, Conceptual model reviewer, Logical model designer, QA engineer, and Test executor. They collaboratively handle the sub-tasks involved in designing the database schema. The red arrow represents the process of error detection and correction.
  • Figure 5: The error feedback process of the logical model designer in our group chat communication mechanism.
  • ...and 6 more figures