RealKIE: Five Novel Datasets for Enterprise Key Information Extraction
Benjamin Townsend, Madison May, Katherine Mackowiak, Christopher Wells
TL;DR
RealKIE presents five enterprise-focused document-level KIE datasets (SEC S1 filings, NDA, UK Charity reports, FCC invoices, Resource contracts) that incorporate real-world OCR artifacts, sparse annotations, and complex layouts. It provides OCR outputs, labeled spans, preprocessing pipelines, and baseline results for four transformer models, illustrating the practical challenges of long-document KIE. The work analyzes OCR quality, layout complexity, sparsity, and data-type variety, and situates RealKIE among existing benchmarks as a more realistic, industry-relevant testbed. By releasing data, OCR, and baselines, RealKIE aims to accelerate robust, scalable information extraction methods applicable to real-world enterprise problems.
Abstract
We introduce RealKIE, a benchmark of five challenging datasets aimed at advancing key information extraction methods, with an emphasis on enterprise applications. The datasets include a diverse range of documents including SEC S1 Filings, US Non-disclosure Agreements, UK Charity Reports, FCC Invoices, and Resource Contracts. Each presents unique challenges: poor text serialization, sparse annotations in long documents, and complex tabular layouts. These datasets provide a realistic testing ground for key information extraction tasks like investment analysis and contract analysis. In addition to presenting these datasets, we offer an in-depth description of the annotation process, document processing techniques, and baseline modeling approaches. This contribution facilitates the development of NLP models capable of handling practical challenges and supports further research into information extraction technologies applicable to industry-specific problems. The annotated data, OCR outputs, and code to reproduce baselines are available to download at https://indicodatasolutions.github.io/RealKIE/.
