DISCO: Document Intelligence Suite for COmparative Evaluation

Kenza Benkirane; Dan Goldwater; Martin Asenov; Aneiss Ghodsi

DISCO: Document Intelligence Suite for COmparative Evaluation

Kenza Benkirane, Dan Goldwater, Martin Asenov, Aneiss Ghodsi

Abstract

Document intelligence requires accurate text extraction and reliable reasoning over document content. We introduce \textbf{DISCO}, a \emph{Document Intelligence Suite for COmparative Evaluation}, that evaluates optical character recognition (OCR) pipelines and vision-language models (VLMs) separately on parsing and question answering across diverse document types, including handwritten text, multilingual scripts, medical forms, infographics, and multi-page documents. Our evaluation shows that performance varies substantially across tasks and document characteristics, underscoring the need for complexity-aware approach selection. OCR pipelines are generally more reliable for handwriting and for long or multi-page documents, where explicit text grounding supports text-heavy reasoning, while VLMs perform better on multilingual text and visually rich layouts. Task-aware prompting yields mixed effects, improving performance on some document types while degrading it on others. These findings provide empirical guidance for selecting document processing strategies based on document structure and reasoning demands.

DISCO: Document Intelligence Suite for COmparative Evaluation

Abstract

Paper Structure (63 sections, 8 figures, 24 tables)

This paper contains 63 sections, 8 figures, 24 tables.

Introduction and Related Work
Datasets and benchmark suite
Methodology and experimental design
Results and discussion
Parsing
Question answering
Conclusion and discussion
Limitations and future work
Full experimental results
Parsing task: in-depth analysis
Handwriting recognition (IAM$_{\text{DISCO}}$)
Multilingual scene text (ICDAR$_{\text{DISCO}}$)
Medical documents (RxPad)
QA task: in-depth analysis
Document questions (DocVQA$_{\text{DISCO}}$)
...and 48 more sections

Figures (8)

Figure 1: Model performance across phases on IAM$_{\text{DISCO}}$.
Figure 2: Model performance across phases on ICDAR$_{\text{DISCO}}$. $P_{\text{OCR}}$: OCR baseline; $P_{\text{VLM-base}}$: VLM with generic prompting; $P_{\text{VLM-task}}$: VLM with task-aware prompting. For CER and WER, lower (green) is better; for ANLS and Cosine Similarity, higher is better.
Figure 3: Model performance across phases on RxPad. $P_{\text{OCR}}$: OCR baseline; $P_{\text{VLM-base}}$: VLM with base prompting; $P_{\text{VLM-task}}$: VLM with task-aware prompting. For $S_{\textrm{CER}}$ and $S_{\textrm{WER}}$, lower (green) is better; for $S_{\textrm{CS}}$, higher is better.
Figure 4: Regression performance when predicting QA correctness from the parsed document data (DocVQA).
Figure 5: DocVQA strategy heatmaps using the primary strategy metric $S_{\textrm{GT-in-Pred}}$ for (1) $QA_{\text{OCR}}$, (2) $QA_{\text{VLM-2stage}}$, and (3) $QA_{\text{VLM-direct}}$.
...and 3 more figures

DISCO: Document Intelligence Suite for COmparative Evaluation

Abstract

DISCO: Document Intelligence Suite for COmparative Evaluation

Authors

Abstract

Table of Contents

Figures (8)