BoundingDocs: a Unified Dataset for Document Question Answering with Spatial Annotations
Simone Giovannini, Fabio Coppini, Andrea Gemelli, Simone Marinai
TL;DR
BoundingDocs tackles the shortage of spatially grounded document QA data by unifying multiple public datasets into a QA format and attaching exact answer bounding boxes. The authors release OCR via Textract for all documents and explore prompting strategies, including layout-aware inputs, to evaluate open-weight models. They show that question rephrasing and incorporating bounding-box information can boost performance, albeit sometimes increasing output parsing complexity, which can be mitigated with regex post-processing. The work demonstrates the value of a standardized, spatially annotated dataset for training and evaluating document-aware LLMs and provides actionable insights for prompting and fine-tuning in document understanding tasks.
Abstract
We present a unified dataset for document Question-Answering (QA), which is obtained combining several public datasets related to Document AI and visually rich document understanding (VRDU). Our main contribution is twofold: on the one hand we reformulate existing Document AI tasks, such as Information Extraction (IE), into a Question-Answering task, making it a suitable resource for training and evaluating Large Language Models; on the other hand, we release the OCR of all the documents and include the exact position of the answer to be found in the document image as a bounding box. Using this dataset, we explore the impact of different prompting techniques (that might include bounding box information) on the performance of open-weight models, identifying the most effective approaches for document comprehension.
