Table of Contents
Fetching ...

BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

Simone Giovannini, Fabio Coppini, Andrea Gemelli, Simone Marinai

TL;DR

BoundingDocs tackles the shortage of spatially grounded document QA data by unifying multiple public datasets into a QA format and attaching exact answer bounding boxes. The authors release OCR via Textract for all documents and explore prompting strategies, including layout-aware inputs, to evaluate open-weight models. They show that question rephrasing and incorporating bounding-box information can boost performance, albeit sometimes increasing output parsing complexity, which can be mitigated with regex post-processing. The work demonstrates the value of a standardized, spatially annotated dataset for training and evaluating document-aware LLMs and provides actionable insights for prompting and fine-tuning in document understanding tasks.

Abstract

We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.

BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations

TL;DR

BoundingDocs tackles the shortage of spatially grounded document QA data by unifying multiple public datasets into a QA format and attaching exact answer bounding boxes. The authors release OCR via Textract for all documents and explore prompting strategies, including layout-aware inputs, to evaluate open-weight models. They show that question rephrasing and incorporating bounding-box information can boost performance, albeit sometimes increasing output parsing complexity, which can be mitigated with regex post-processing. The work demonstrates the value of a standardized, spatially annotated dataset for training and evaluating document-aware LLMs and provides actionable insights for prompting and fine-tuning in document understanding tasks.

Abstract

We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
Paper Structure (28 sections, 18 figures, 16 tables)

This paper contains 28 sections, 18 figures, 16 tables.

Figures (18)

  • Figure 1: Dataset construction pipeline. The process begins with two main dataset categories: QA datasets (red) and Key-Value extraction datasets (blue). Both categories are processed using AWS OCR (Textract), followed by annotation matching. For the Key-Value extraction datasets, questions are generated and rephrased using Mistral 7B v0.3. All processed components are then unified into BoundingDocs (purple).
  • Figure 2: Sample of QA pairs from the dataset. The left QA pair is sourced from Deepform, while the right one is from Kleister Charity. The purple values represent the specific details related to each QA pair, and the blue keys denote the fixed structure defined for our dataset.
  • Figure 3: Deepform page with bbox annotations.
  • Figure 4: VRDU Registration Form page with bbox annotations.
  • Figure 5: Experimental framework showing the different prompting strategies implemented for vision and text language models. Vision LLMs (blue) are prompted with images and rephrased questions, while text LLMs (red) are prompted with three different configurations of page content and questions. Some models underwent supervised fine-tuning before testing, while others were evaluated directly.
  • ...and 13 more figures