Table of Contents
Fetching ...

"What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs

Ran Zmigrod, Pranav Shetty, Mathieu Sibue, Zhiqiang Ma, Armineh Nourbakhsh, Xiaomo Liu, Manuela Veloso

TL;DR

This work addresses the need for robust prompt-response benchmarks in visually rich document understanding (VRDU) by transforming five KIE datasets into a diverse, template-rich prompt dataset family called K2Q, comprising over 300,000 questions across more than 12,000 documents. By contrasting simple templates with manually crafted and richly parametrized templates, the authors show that diverse, complex questions improve model robustness and grounding for VRDU tasks. They benchmark seven models (across OCR-based and OCR-free families) under zero-shot and fine-tuned settings, revealing that training on K2Q enhances generalization to unseen question formulations, while still highlighting challenges in grounding and OCR-induced errors. The public release of K2Q aims to elevate data quality for instruction tuning and robust evaluation of multimodal document understanding systems, with future work aimed at expanding templates, few-shot regimes, and multi-turn reasoning.

Abstract

The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.

"What is the value of {templates}?" Rethinking Document Information Extraction Datasets for LLMs

TL;DR

This work addresses the need for robust prompt-response benchmarks in visually rich document understanding (VRDU) by transforming five KIE datasets into a diverse, template-rich prompt dataset family called K2Q, comprising over 300,000 questions across more than 12,000 documents. By contrasting simple templates with manually crafted and richly parametrized templates, the authors show that diverse, complex questions improve model robustness and grounding for VRDU tasks. They benchmark seven models (across OCR-based and OCR-free families) under zero-shot and fine-tuned settings, revealing that training on K2Q enhances generalization to unseen question formulations, while still highlighting challenges in grounding and OCR-induced errors. The public release of K2Q aims to elevate data quality for instruction tuning and robust evaluation of multimodal document understanding systems, with future work aimed at expanding templates, few-shot regimes, and multi-turn reasoning.

Abstract

The rise of large language models (LLMs) for visually rich document understanding (VRDU) has kindled a need for prompt-response, document-based datasets. As annotating new datasets from scratch is labor-intensive, the existing literature has generated prompt-response datasets from available resources using simple templates. For the case of key information extraction (KIE), one of the most common VRDU tasks, past work has typically employed the template "What is the value for the {key}?". However, given the variety of questions encountered in the wild, simple and uniform templates are insufficient for creating robust models in research and industrial contexts. In this work, we present K2Q, a diverse collection of five datasets converted from KIE to a prompt-response format using a plethora of bespoke templates. The questions in K2Q can span multiple entities and be extractive or boolean. We empirically compare the performance of seven baseline generative models on K2Q with zero-shot prompting. We further compare three of these models when training on K2Q versus training on simpler templates to motivate the need of our work. We find that creating diverse and intricate KIE questions enhances the performance and robustness of VRDU models. We hope this work encourages future studies on data quality for generative model training.

Paper Structure

This paper contains 46 sections, 3 equations, 16 figures, 14 tables.

Figures (16)

  • Figure 1: Generation pipeline of K2Q datasets. A suite of diverse templates is designed for each specific KIE dataset. These templates are populated in accordance to a configuration file that configures the dataset size and proportion of extractive and boolean questions.
  • Figure 2: Examples of populated questions and answers from K2Q.
  • Figure 3: Comparison of training and evaluating on complex questions (mK2Q) and simple questions (SD).
  • Figure 4: Detailed breakdown of groundedness and error types for KLC using different training/testing datasets.
  • Figure 5: Excerpt of an Ad-Buy document with generated questions from K2Q, InstructDoc, UReader, and SD. The K2Q question "From when until when is the contract in flight?" uses jargon specific to the advertising domain. Applying such templates allows for creating domain-specific and diverse questions, which may differ from what is colloquially used. The generated question is thus grounded in the jargon used in the document.
  • ...and 11 more figures