Table of Contents
Fetching ...

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

Yifan Ji, Zhipeng Xu, Zhenghao Liu, Zulong Chen, Qian Zhang, Zhibo Yang, Junyang Lin, Yu Gu, Ge Yu, Maosong Sun

TL;DR

UniKIE-Bench tackles the challenge of robust key information extraction (KIE) from visually rich documents by introducing a schema-guided, end-to-end benchmark for large multimodal models. It formalizes KIE with two formulations, $y^{QA} = { f: y_f }_{f \in \mathcal{F}}$ and $\\mathbf{y}^{SG} = \\mathcal{M}(x,s)$, and implements two tracks: constrained-category (scenario-based schemas) and open-category (document-level schemas), enabling comprehensive evaluation across diverse document types and information demands. The benchmark assembles 6,133 documents across 3 domains and 11 scenarios for constrained KIE and generates multilingual, synthetically augmented open-category data with realistic noise, ground-truth schemas, and OCR-corrected labels to enable end-to-end LMM assessment. Evaluations on 15 state-of-the-art LMMs reveal substantial degradation under varied schemas, long-tail fields, and complex layouts, with notable disparities between closed- and open-source models and across languages, underscoring the need for grounded, layout-aware reasoning and faithful extraction. The work provides a unified evaluation protocol, release-ready data and metrics, and a public codebase, aiming to drive the development of more robust and trustworthy document understanding systems.

Abstract

Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.

UNIKIE-BENCH: Benchmarking Large Multimodal Models for Key Information Extraction in Visual Documents

TL;DR

UniKIE-Bench tackles the challenge of robust key information extraction (KIE) from visually rich documents by introducing a schema-guided, end-to-end benchmark for large multimodal models. It formalizes KIE with two formulations, and , and implements two tracks: constrained-category (scenario-based schemas) and open-category (document-level schemas), enabling comprehensive evaluation across diverse document types and information demands. The benchmark assembles 6,133 documents across 3 domains and 11 scenarios for constrained KIE and generates multilingual, synthetically augmented open-category data with realistic noise, ground-truth schemas, and OCR-corrected labels to enable end-to-end LMM assessment. Evaluations on 15 state-of-the-art LMMs reveal substantial degradation under varied schemas, long-tail fields, and complex layouts, with notable disparities between closed- and open-source models and across languages, underscoring the need for grounded, layout-aware reasoning and faithful extraction. The work provides a unified evaluation protocol, release-ready data and metrics, and a public codebase, aiming to drive the development of more robust and trustworthy document understanding systems.

Abstract

Key Information Extraction (KIE) from real-world documents remains challenging due to substantial variations in layout structures, visual quality, and task-specific information requirements. Recent Large Multimodal Models (LMMs) have shown promising potential for performing end-to-end KIE directly from document images. To enable a comprehensive and systematic evaluation across realistic and diverse application scenarios, we introduce UNIKIE-BENCH, a unified benchmark designed to rigorously evaluate the KIE capabilities of LMMs. UNIKIE-BENCH consists of two complementary tracks: a constrained-category KIE track with scenario-predefined schemas that reflect practical application needs, and an open-category KIE track that extracts any key information that is explicitly present in the document. Experiments on 15 state-of-the-art LMMs reveal substantial performance degradation under diverse schema definitions, long-tail key fields, and complex layouts, along with pronounced performance disparities across different document types and scenarios. These findings underscore persistent challenges in grounding accuracy and layout-aware reasoning for LMM-based KIE. All codes and datasets are available at https://github.com/NEUIR/UNIKIE-BENCH.
Paper Structure (24 sections, 3 equations, 15 figures, 7 tables)

This paper contains 24 sections, 3 equations, 15 figures, 7 tables.

Figures (15)

  • Figure 1: Overview of our UniKIE-Bench. UniKIE-Bench is built upon a schema-guided KIE formulation to enable end-to-end evaluation of LMMs. It comprises two complementary evaluation tracks: the constrained-category KIE track and the open-category KIE track.
  • Figure 2: Faithfulness Analysis of LMMs in KIE. We analyzed the relationship between the faithfulness and the extraction performance of LMMs in KIE.
  • Figure 3: Typical Error Cases of LMMs in KIE. Blue boxes indicate the ground truth, while red boxes denote the model predictions.
  • Figure 4: Document Authenticity Analysis in the Open-Category KIE Track.
  • Figure 5: Diversity Analysis of Document Images in the Open-Category KIE Track.
  • ...and 10 more figures