Table of Contents
Fetching ...

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso

TL;DR

ExStrucTiny introduces a versatile benchmark for structured information extraction from document images, unifying KEE, RE, and VQA under schema-driven and on-demand queries. The authors combine manual annotations with large-scale, LLM-assisted synthetic data and a robust quality-validation pipeline to create 304 QA pairs across 110 documents, accompanied by a novel evaluation framework that relies on semantic mapping between ground-truth and predicted outputs. Experimental results show closed VLMs offer stronger recall and robustness, while model size benefits open models, though all faces with schema adaptation, underspecification, and precise localization, especially for charts and free text. The dataset and evaluation methodology aim to spur progress toward generalist models capable of flexible, multi-entity structured IE in real-world document understanding tasks. ExStrucTiny thus provides a bedrock for developing schema-aware, visually grounded IE capable of handling diverse document types and complex extraction requests.

Abstract

Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.

ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images

TL;DR

ExStrucTiny introduces a versatile benchmark for structured information extraction from document images, unifying KEE, RE, and VQA under schema-driven and on-demand queries. The authors combine manual annotations with large-scale, LLM-assisted synthetic data and a robust quality-validation pipeline to create 304 QA pairs across 110 documents, accompanied by a novel evaluation framework that relies on semantic mapping between ground-truth and predicted outputs. Experimental results show closed VLMs offer stronger recall and robustness, while model size benefits open models, though all faces with schema adaptation, underspecification, and precise localization, especially for charts and free text. The dataset and evaluation methodology aim to spur progress toward generalist models capable of flexible, multi-entity structured IE in real-world document understanding tasks. ExStrucTiny thus provides a bedrock for developing schema-aware, visually grounded IE capable of handling diverse document types and complex extraction requests.

Abstract

Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
Paper Structure (42 sections, 2 equations, 5 figures, 8 tables)

This paper contains 42 sections, 2 equations, 5 figures, 8 tables.

Figures (5)

  • Figure 1: Extraction performance declines for open source VLMs as number of values to extract increases.
  • Figure 2: Three example query-answer pairs from ExStrucTiny. Top: on-demand IE query in plain text. Middle: closed IE query with schema. Bottom: closed IE query in plain text. Answers are JSON-formatted objects following the same structuring conventions.
  • Figure 3: Example entities requested in ExStrucTiny
  • Figure 4: ExStrucTiny Synthetic Data Generation Pipeline
  • Figure 5: Extraction performance breakdown on synthetic QAs: basic vs. fully & partially unanswerable vs. reformulated (averaged across VLMs).