Table of Contents
Fetching ...

TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

Yizhu Jiao, Sha Li, Sizhe Zhou, Heng Ji, Jiawei Han

TL;DR

Text2DB reframes information extraction as integration-aware, aiming to enrich an existing database by extracting and normalizing values from documents under user instructions. It introduces OPAL, a three-agent framework (Observer, Planner, Analyzer) that uses tool-based IE models and code-based plans to adaptively update diverse database schemas. A new benchmark combining WikiTables and BIRD databases across data infilling, row population, and column addition assesses schema-adaptation and integration challenges, with detailed error analyses. The work demonstrates that dynamic planning and database-informed demonstrations significantly improve update quality, highlighting the importance of tool learning and verification in large-language-model-assisted data integration.

Abstract

The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

TEXT2DB: Integration-Aware Information Extraction with Large Language Model Agents

TL;DR

Text2DB reframes information extraction as integration-aware, aiming to enrich an existing database by extracting and normalizing values from documents under user instructions. It introduces OPAL, a three-agent framework (Observer, Planner, Analyzer) that uses tool-based IE models and code-based plans to adaptively update diverse database schemas. A new benchmark combining WikiTables and BIRD databases across data infilling, row population, and column addition assesses schema-adaptation and integration challenges, with detailed error analyses. The work demonstrates that dynamic planning and database-informed demonstrations significantly improve update quality, highlighting the importance of tool learning and verification in large-language-model-assisted data integration.

Abstract

The task of information extraction (IE) is to extract structured knowledge from text. However, it is often not straightforward to utilize IE output due to the mismatch between the IE ontology and the downstream application needs. We propose a new formulation of IE TEXT2DB that emphasizes the integration of IE output and the target database (or knowledge base). Given a user instruction, a document set, and a database, our task requires the model to update the database with values from the document set to satisfy the user instruction. This task requires understanding user instructions for what to extract and adapting to the given DB/KB schema for how to extract on the fly. To evaluate this new task, we introduce a new benchmark featuring common demands such as data infilling, row population, and column addition. In addition, we propose an LLM agent framework OPAL (Observe-PlanAnalyze LLM) which includes an Observer component that interacts with the database, the Planner component that generates a code-based plan with calls to IE models, and the Analyzer component that provides feedback regarding code quality before execution. Experiments show that OPAL can successfully adapt to diverse database schemas by generating different code plans and calling the required IE models. We also highlight difficult cases such as dealing with large databases with complex dependencies and extraction hallucination, which we believe deserve further investigation. Source code: https://github.com/yzjiao/Text2DB

Paper Structure

This paper contains 25 sections, 22 figures, 6 tables.

Figures (22)

  • Figure 1: Our Text2DB task is defined over a database, a user instruction, and a document set. The model aims to fulfill the user instruction by updating the database with values (shown in yellow) extracted from text. In this example, the input database has two tables linked with the foreign key constraint (DirectorID in the Movie table refers to ID of the Director table).
  • Figure 2: Three major challenges of the Text2DB task: (1) dynamically decide what to extract by analyzing complex database schemas and interpreting user instructions; (2) resolve extraction ambiguity to ensure extracted values match the semantics and granularity of existing database content; (3) integrate the extracted data into the database while maintaining integrity and consistency.
  • Figure 3: Example of the generated code in one pass, which misconfigures the attribute extraction tool since the tool expects a list of attributes rather than a string.
  • Figure 4: Framework architecture.
  • Figure 5: Error distribution of the whole framework.
  • ...and 17 more figures