Table of Contents
Fetching ...

KIEval: Evaluation Metric for Document Key Information Extraction

Minsoo Khang, Sang Chul Jung, Sungrae Park, Teakgyu Hong

TL;DR

This work addresses the mismatch between industrial needs for Document KIE and existing evaluation metrics that ignore grouping and correction costs. It introduces KIEval, a two-level evaluation framework that uses group matching and two metrics: KIEval Entity F1 and KIEval Group F1, plus KIEval_Aligned, which expresses errors as substitution/addition/deletion costs and aligns with real-world correction costs ($\text{KIEval}_{\text{Aligned}} = \frac{TP^{\text{entity}}}{TP^{\text{entity}} + \text{Error}}$). The approach is validated on SROIE, CORD, and FUNSD with diverse model families (LayoutXLM, LayoutLMv3, Donut) and even zero-shot LLMs, demonstrating structure-aware evaluation and practical trade-offs in RPA workflows. It also demonstrates how KIEval supports automation-rate versus accuracy decisions via threshold-based post-processing in RPA contexts.

Abstract

Document Key Information Extraction (KIE) is a technology that transforms valuable information in document images into structured data, and it has become an essential function in industrial settings. However, current evaluation metrics of this technology do not accurately reflect the critical attributes of its industrial applications. In this paper, we present KIEval, a novel application-centric evaluation metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual information (entity) but also of the structured information (grouping). Evaluation of structured information provides assessment of Document KIE models that are more reflective of extracting grouped information from documents in industrial settings. Designed with industrial application in mind, we believe that KIEval can become a standard evaluation metric for developing or applying Document KIE models in practice. The code will be publicly available.

KIEval: Evaluation Metric for Document Key Information Extraction

TL;DR

This work addresses the mismatch between industrial needs for Document KIE and existing evaluation metrics that ignore grouping and correction costs. It introduces KIEval, a two-level evaluation framework that uses group matching and two metrics: KIEval Entity F1 and KIEval Group F1, plus KIEval_Aligned, which expresses errors as substitution/addition/deletion costs and aligns with real-world correction costs (). The approach is validated on SROIE, CORD, and FUNSD with diverse model families (LayoutXLM, LayoutLMv3, Donut) and even zero-shot LLMs, demonstrating structure-aware evaluation and practical trade-offs in RPA workflows. It also demonstrates how KIEval supports automation-rate versus accuracy decisions via threshold-based post-processing in RPA contexts.

Abstract

Document Key Information Extraction (KIE) is a technology that transforms valuable information in document images into structured data, and it has become an essential function in industrial settings. However, current evaluation metrics of this technology do not accurately reflect the critical attributes of its industrial applications. In this paper, we present KIEval, a novel application-centric evaluation metric for Document KIE models. Unlike prior metrics, KIEval assesses Document KIE models not just on the extraction of individual information (entity) but also of the structured information (grouping). Evaluation of structured information provides assessment of Document KIE models that are more reflective of extracting grouped information from documents in industrial settings. Designed with industrial application in mind, we believe that KIEval can become a standard evaluation metric for developing or applying Document KIE models in practice. The code will be publicly available.

Paper Structure

This paper contains 23 sections, 10 equations, 7 figures, 4 tables.

Figures (7)

  • Figure 1: Example of CORD dataset (receipts). The dataset has non-grouped and grouped entities (non-grouped entities form a special group), and requires structured predictions including Menu groups: Menu.name and Menu.price. Errors in model predictions are not limited to individual key-value pair errors but also in the extraction of structural relation between entities (marked in red). Both error types must be considered in Document KIE model evaluation.
  • Figure 2: Comparison of evaluation metrics for KIE tasks. Note that the ground-truth and predictions follow the same Document Image in Fig. \ref{['fig:teaser']}. The red boxes indicate the errors accounted for by the respective metrics during evaluation. Entity-level F1 does not account for structural relations unlike the proposed KIEval metric which performs both Entity-level and Group-level evaluations based on the group-matching information (blue links).
  • Figure 3: F1 score examples in three different scenarios. From the application perspective, the three different scenarios require the same number of error corrections; (#1) filling missing information, (#2) replacing wrong information, and (#3) deleting the unexpected information. However, in the view of F1 scores, false negative (FN) and false positive (FP) are separately counted to identify representative score value, F1.
  • Figure 4: Sample images from the SROIE (left), CORD (center), and FUNSD (right) datasets.
  • Figure 5: Examples illustrating the difference between Entity F1 and KIEval. The above scenario is constructed to showcase metric disparities, whereas the scenario below is based on real prediction result from the Donut model.
  • ...and 2 more figures