Table of Contents
Fetching ...

RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

Benjamin Townsend, Madison May, Katherine Mackowiak, Christopher Wells

TL;DR

RealKIE presents five enterprise-focused document-level KIE datasets (SEC S1 filings, NDA, UK Charity reports, FCC invoices, Resource contracts) that incorporate real-world OCR artifacts, sparse annotations, and complex layouts. It provides OCR outputs, labeled spans, preprocessing pipelines, and baseline results for four transformer models, illustrating the practical challenges of long-document KIE. The work analyzes OCR quality, layout complexity, sparsity, and data-type variety, and situates RealKIE among existing benchmarks as a more realistic, industry-relevant testbed. By releasing data, OCR, and baselines, RealKIE aims to accelerate robust, scalable information extraction methods applicable to real-world enterprise problems.

Abstract

We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and contract analysis. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data, OCR outputs, and code to reproduce baselines are available to download at https://indicodatasolutions.github.io/RealKIE/.

RealKIE: Five Novel Datasets for Enterprise Key Information Extraction

TL;DR

RealKIE presents five enterprise-focused document-level KIE datasets (SEC S1 filings, NDA, UK Charity reports, FCC invoices, Resource contracts) that incorporate real-world OCR artifacts, sparse annotations, and complex layouts. It provides OCR outputs, labeled spans, preprocessing pipelines, and baseline results for four transformer models, illustrating the practical challenges of long-document KIE. The work analyzes OCR quality, layout complexity, sparsity, and data-type variety, and situates RealKIE among existing benchmarks as a more realistic, industry-relevant testbed. By releasing data, OCR, and baselines, RealKIE aims to accelerate robust, scalable information extraction methods applicable to real-world enterprise problems.

Abstract

We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and contract analysis. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data, OCR outputs, and code to reproduce baselines are available to download at https://indicodatasolutions.github.io/RealKIE/.
Paper Structure (24 sections, 2 figures, 26 tables)

This paper contains 24 sections, 2 figures, 26 tables.

Figures (2)

  • Figure 1: This snippet of an FCC invoice is an example of reading order ambiguity and character recognition ambiguity. There are many equally correct ways to serialize this content. This characteristic is referred to as inherent reading order ambiguity. The bottom lines illustrate lower OCR confidences, indicating character recognition ambiguity. We can see that processes applied to this document, likely being printed and then scanned, have introduced some corruption of letters with "PRICE" reading as "PRICB" and "SCHEDULE" as "SCNEOULE".
  • Figure 2: Part of a table from the FCC Invoices dataset. In Table \ref{['tab:layout-and-quality']}, this would simply show as a table. However, it contains features that significantly increase modeling difficulty compared to a typical table structure. For example, the slots per day indicator "22222--" is directly under the Air Time header but does not relate to it. Similarly, the date range values of the outer table are merged left across another labeled "Day" header. These complications vary significantly between different broadcasters.