Table of Contents
Fetching ...

Learning to Extract Structured Entities Using Language Models

Haolun Wu, Ye Yuan, Liana Mikaelyan, Alexander Meulemans, Xue Liu, James Hensman, Bhaskar Mitra

TL;DR

A new Multistage Structured Entity Extraction (MuSEE) model is introduced that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages, offering promising directions for future advancements in structured entity extraction.

Abstract

Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new Multistage Structured Entity Extraction (MuSEE) model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction. Our source code and datasets are available at https://github.com/microsoft/Structured-Entity-Extraction.

Learning to Extract Structured Entities Using Language Models

TL;DR

A new Multistage Structured Entity Extraction (MuSEE) model is introduced that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages, offering promising directions for future advancements in structured entity extraction.

Abstract

Recent advances in machine learning have significantly impacted the field of information extraction, with Language Models (LMs) playing a pivotal role in extracting structured information from unstructured text. Prior works typically represent information extraction as triplet-centric and use classical metrics such as precision and recall for evaluation. We reformulate the task to be entity-centric, enabling the use of diverse metrics that can provide more insights from various perspectives. We contribute to the field by introducing Structured Entity Extraction and proposing the Approximate Entity Set OverlaP (AESOP) metric, designed to appropriately assess model performance. Later, we introduce a new Multistage Structured Entity Extraction (MuSEE) model that harnesses the power of LMs for enhanced effectiveness and efficiency by decomposing the extraction task into multiple stages. Quantitative and human side-by-side evaluations confirm that our model outperforms baselines, offering promising directions for future advancements in structured entity extraction. Our source code and datasets are available at https://github.com/microsoft/Structured-Entity-Extraction.
Paper Structure (34 sections, 5 equations, 11 figures, 4 tables)

This paper contains 34 sections, 5 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Illustration of the structured entity extraction, an entity-centric formulation of information extraction. Given a text description as well as some predefined schema containing all the candidates of entity types and property keys, we aim to output a structured json for all entities in the text with their information.
  • Figure 2: The pipeline of our proposed MuSEE model, which is built on an encoder-decoder architecture. The input text only needs to be encoded once. The decoder is shared for all the three stages. All predictions within each stage can be processed in batch, and teacher forcing enables parallelization even across stages during training.
  • Figure 3: An overall effectiveness-and-efficiency comparison across models on Wikidata-based Dataset. MuSEE strongly outperforms all baselines on both measures. The effectiveness is measured by AESOP.
  • Figure 4: Grounding check across models on the Wikidata-based dataset. MuSEE shows the least performance drop on the perturbed version of data compared to other baselines.
  • Figure 5: An illustration of the AESOP metric, including optimal entity assignment (phase 1) and pairwise entity comparison (phase 2), and overall metric computation with various similarity and normalization choices.
  • ...and 6 more figures