ExStrucTiny: A Benchmark for Schema-Variable Structured Information Extraction from Document Images
Mathieu Sibue, Andres Muñoz Garza, Samuel Mensah, Pranav Shetty, Zhiqiang Ma, Xiaomo Liu, Manuela Veloso
TL;DR
ExStrucTiny introduces a versatile benchmark for structured information extraction from document images, unifying KEE, RE, and VQA under schema-driven and on-demand queries. The authors combine manual annotations with large-scale, LLM-assisted synthetic data and a robust quality-validation pipeline to create 304 QA pairs across 110 documents, accompanied by a novel evaluation framework that relies on semantic mapping between ground-truth and predicted outputs. Experimental results show closed VLMs offer stronger recall and robustness, while model size benefits open models, though all faces with schema adaptation, underspecification, and precise localization, especially for charts and free text. The dataset and evaluation methodology aim to spur progress toward generalist models capable of flexible, multi-entity structured IE in real-world document understanding tasks. ExStrucTiny thus provides a bedrock for developing schema-aware, visually grounded IE capable of handling diverse document types and complex extraction requests.
Abstract
Enterprise documents, such as forms and reports, embed critical information for downstream applications like data archiving, automated workflows, and analytics. Although generalist Vision Language Models (VLMs) perform well on established document understanding benchmarks, their ability to conduct holistic, fine-grained structured extraction across diverse document types and flexible schemas is not well studied. Existing Key Entity Extraction (KEE), Relation Extraction (RE), and Visual Question Answering (VQA) datasets are limited by narrow entity ontologies, simple queries, or homogeneous document types, often overlooking the need for adaptable and structured extraction. To address these gaps, we introduce ExStrucTiny, a new benchmark dataset for structured Information Extraction (IE) from document images, unifying aspects of KEE, RE, and VQA. Built through a novel pipeline combining manual and synthetic human-validated samples, ExStrucTiny covers more varied document types and extraction scenarios. We analyze open and closed VLMs on this benchmark, highlighting challenges such as schema adaptation, query under-specification, and answer localization. We hope our work provides a bedrock for improving generalist models for structured IE in documents.
