Table of Contents
Fetching ...

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

Oshri Naparstek, Roi Pony, Inbar Shapira, Foad Abo Dahood, Ophir Azulai, Yevgeny Yaroker, Nadav Rubinstein, Maksym Lysak, Peter Staar, Ahmed Nassar, Nikolaos Livathinos, Christoph Auer, Elad Amrani, Idan Friedman, Orit Prince, Yevgeny Burshtein, Adi Raz Goldfarb, Udi Barzelay

TL;DR

The paper tackles the problem of extracting key-value pairs from business documents without relying on predefined keys, a gap left by existing KIE-focused datasets. It introduces KVP10k, a large-scale real-world dataset with 10,707 annotated pages and a 17-class annotation scheme, along with a two-task benchmark that combines KVP and KIE challenges. The authors provide data acquisition pipelines (Common Crawl and FCC), detailed annotation guidelines, and extensive statistics to showcase diversity and richness, plus an open-source benchmarking toolkit and baselines using LMDX-style generation with Mistral-7B. This work enables more robust information extraction in real-world documents and sets a foundation for future research on non-predetermined KVP extraction and cross-domain generalization.

Abstract

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.

KVP10k : A Comprehensive Dataset for Key-Value Pair Extraction in Business Documents

TL;DR

The paper tackles the problem of extracting key-value pairs from business documents without relying on predefined keys, a gap left by existing KIE-focused datasets. It introduces KVP10k, a large-scale real-world dataset with 10,707 annotated pages and a 17-class annotation scheme, along with a two-task benchmark that combines KVP and KIE challenges. The authors provide data acquisition pipelines (Common Crawl and FCC), detailed annotation guidelines, and extensive statistics to showcase diversity and richness, plus an open-source benchmarking toolkit and baselines using LMDX-style generation with Mistral-7B. This work enables more robust information extraction in real-world documents and sets a foundation for future research on non-predetermined KVP extraction and cross-domain generalization.

Abstract

In recent years, the challenge of extracting information from business documents has emerged as a critical task, finding applications across numerous domains. This effort has attracted substantial interest from both industry and academy, highlighting its significance in the current technological landscape. Most datasets in this area are primarily focused on Key Information Extraction (KIE), where the extraction process revolves around extracting information using a specific, predefined set of keys. Unlike most existing datasets and benchmarks, our focus is on discovering key-value pairs (KVPs) without relying on predefined keys, navigating through an array of diverse templates and complex layouts. This task presents unique challenges, primarily due to the absence of comprehensive datasets and benchmarks tailored for non-predetermined KVP extraction. To address this gap, we introduce KVP10k , a new dataset and benchmark specifically designed for KVP extraction. The dataset contains 10707 richly annotated images. In our benchmark, we also introduce a new challenging task that combines elements of KIE as well as KVP in a single task. KVP10k sets itself apart with its extensive diversity in data and richly detailed annotations, paving the way for advancements in the field of information extraction from complex business documents.
Paper Structure (16 sections, 10 figures, 1 table)

This paper contains 16 sections, 10 figures, 1 table.

Figures (10)

  • Figure 1: Comparative overview of KVP10k versus other datasets: Comparing the Number of Documents, Entities, Keys, Values, and Links
  • Figure 2: A schematic describing the data collection process using web crawling.
  • Figure 3: Example of an annotated page
  • Figure 4: Exemplifying Versatility: A collage of diverse document categories from KVP10k Dataset
  • Figure 5: Distribution of entities per page in KVP10k .
  • ...and 5 more figures