Assessing the quality of information extraction

Filip Seitl; Tomáš Kovářík; Soheyla Mirshahi; Jan Kryštůfek; Rastislav Dujava; Matúš Ondreička; Herbert Ullrich; Petr Gronat

Assessing the quality of information extraction

Filip Seitl, Tomáš Kovářík, Soheyla Mirshahi, Jan Kryštůfek, Rastislav Dujava, Matúš Ondreička, Herbert Ullrich, Petr Gronat

TL;DR

The paper tackles the challenge of evaluating information extraction quality when labeled data is scarce by introducing a needle-based synthetic ground truth and the MINEA score. It proposes a schema-driven approach to structure extracted information and analyzes how length constraints and the Lost-in-the-middle phenomenon affect IE via iterative, piecewise extraction. The study demonstrates an automatic, domain-adaptable evaluation framework that enables objective quality assessment without manual labeling and compares LLMs using MINEA on a healthcare business corpus. The work advances practical IE evaluation by enabling robust, scalable assessment without extensive manual annotation, and it provides actionable guidance on iteration strategies and model selection.

Abstract

Advances in large language models have notably enhanced the efficiency of information extraction from unstructured and semi-structured data sources. As these technologies become integral to various applications, establishing an objective measure for the quality of information extraction becomes imperative. However, the scarcity of labeled data presents significant challenges to this endeavor. In this paper, we introduce an automatic framework to assess the quality of the information extraction/retrieval and its completeness. The framework focuses on information extraction in the form of entity and its properties. We discuss how to handle the input/output size limitations of the large language models and analyze their performance when extracting the information. In particular, we introduce scores to evaluate the quality of the extraction and provide an extensive discussion on how to interpret them.

Assessing the quality of information extraction

TL;DR

Abstract

Paper Structure (15 sections, 1 equation, 8 figures, 6 tables)

This paper contains 15 sections, 1 equation, 8 figures, 6 tables.

Introduction
Related work
Capturing the structure
Schema
The role of LLMs
Length aspects
Length restrictions
Lost in the middle
Quality of extraction
Iterated LLM calls
Test the quality
Needles
Multiple infused needle extraction accuracy
Identification of needles
Model comparison

Figures (8)

Figure 1: Toy example: structured information encapsulating three entities using schema.org.
Figure 2: Toy example: two needles, highlighted by blue color, accompanied by additional information described by 'name', 'description', and 'keywords'.
Figure 3: Toy example: extracted information from the data infused by needles from Figure \ref{['fig:needles_ex']}.
Figure 4: Prompt to determine a possible suitable schema from a given text -- Wikipedia article about IE.
Figure 5: Schema.org types found by an LLM within Wikipedia article about IE.
...and 3 more figures

Assessing the quality of information extraction

TL;DR

Abstract

Assessing the quality of information extraction

Authors

TL;DR

Abstract

Table of Contents

Figures (8)