Table of Contents
Fetching ...

Document Understanding Dataset and Evaluation (DUDE)

Jordy Van Landeghem, Rubén Tito, Łukasz Borchmann, Michał Pietruszka, Paweł Józiak, Rafał Powalski, Dawid Jurkiewicz, Mickaël Coustaty, Bertrand Ackaert, Ernest Valveny, Matthew Blaschko, Sien Moens, Tomasz Stanisławek

TL;DR

DUDE introduces a large-scale, multi-page, multi-domain DocVQA benchmark to push practical document understanding beyond single-domain or single-page tasks. By combining diverse documents, complex multi-hop and layout-navigating questions, and varied answer types, it exposes gaps in current DocVQA methods, especially in integrating language, vision, and layout information for long documents. The paper presents a rigorous evaluation framework with ANLS, calibration (ECE), and selective-risk (AURC) metrics, along with extensive baselines spanning encoder-only, encoder-decoder, and multi-modal models, including LLMs. Key findings show that state-of-the-art models lag behind human performance, long-context and 2D layout representations help, and there is a clear need for robust visual and layout-aware modeling in real-world DU tasks. DUDE thus provides a practical, extensible platform to drive future architectural innovations and better calibration for DocAI systems.

Abstract

We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins, and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI.

Document Understanding Dataset and Evaluation (DUDE)

TL;DR

DUDE introduces a large-scale, multi-page, multi-domain DocVQA benchmark to push practical document understanding beyond single-domain or single-page tasks. By combining diverse documents, complex multi-hop and layout-navigating questions, and varied answer types, it exposes gaps in current DocVQA methods, especially in integrating language, vision, and layout information for long documents. The paper presents a rigorous evaluation framework with ANLS, calibration (ECE), and selective-risk (AURC) metrics, along with extensive baselines spanning encoder-only, encoder-decoder, and multi-modal models, including LLMs. Key findings show that state-of-the-art models lag behind human performance, long-context and 2D layout representations help, and there is a clear need for robust visual and layout-aware modeling in real-world DU tasks. DUDE thus provides a practical, extensible platform to drive future architectural innovations and better calibration for DocAI systems.

Abstract

We call on the Document AI (DocAI) community to reevaluate current methodologies and embrace the challenge of creating more practically-oriented benchmarks. Document Understanding Dataset and Evaluation (DUDE) seeks to remediate the halted research progress in understanding visually-rich documents (VRDs). We present a new dataset with novelties related to types of questions, answers, and document layouts based on multi-industry, multi-domain, and multi-page VRDs of various origins, and dates. Moreover, we are pushing the boundaries of current methods by creating multi-task and multi-domain evaluation setups that more accurately simulate real-world situations where powerful generalization and adaptation under low-resource settings are desired. DUDE aims to set a new standard as a more practical, long-standing benchmark for the community, and we hope that it will lead to future extensions and contributions that address real-world challenges. Finally, our work illustrates the importance of finding more efficient ways to model language, images, and layout in DocAI.
Paper Structure (51 sections, 5 equations, 17 figures, 5 tables)

This paper contains 51 sections, 5 equations, 17 figures, 5 tables.

Figures (17)

  • Figure 1: Visualization of inter-document similarities between samples from different datasets (t-SNE over TF-IDF representations of 1k passages from each source).
  • Figure 2: Distribution of the number of tokens in documents, answers, and questions.
  • Figure 3: While other datasets are predominantly single-page only, the number of pages featuring in DUDE is more diverse, yet still biased towards shorter documents.
  • Figure 4: Count of particular diagnostic categories in a subset of 2.5k test set QA pairs annotated in detail to help analyze models' performance.
  • Figure 5: We report the average ANLS for the human expert vs. the best-performing model per diagnostic category as a ceiling analysis.
  • ...and 12 more figures