Table of Contents
Fetching ...

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

Ketong Chen, Yuhao Chen, Yang Xue

TL;DR

MosaicDoc targets Visually Rich Document Understanding by addressing the lack of realistic, multilingual benchmarks. The authors introduce DocWeaver, a fully automated, multi-agent pipeline powered by LLMs to generate high-fidelity, multi-task annotations for complex newspaper and magazine documents, resulting in MosaicDoc with 72K images and over 600K QA pairs across OCR, DocVQA, reading order, and localization in English and Chinese. The benchmark exposes limitations in current models, particularly in dense layouts and multi-span reasoning, and establishes a new, more challenging baseline via extensive evaluation of 13 SOTA models. The work demonstrates the practicality of automated data creation for VRDU and outlines directions to extend the pipeline to historical and handwritten documents, enhancing robustness and scope of document intelligence research.

Abstract

Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding

TL;DR

MosaicDoc targets Visually Rich Document Understanding by addressing the lack of realistic, multilingual benchmarks. The authors introduce DocWeaver, a fully automated, multi-agent pipeline powered by LLMs to generate high-fidelity, multi-task annotations for complex newspaper and magazine documents, resulting in MosaicDoc with 72K images and over 600K QA pairs across OCR, DocVQA, reading order, and localization in English and Chinese. The benchmark exposes limitations in current models, particularly in dense layouts and multi-span reasoning, and establishes a new, more challenging baseline via extensive evaluation of 13 SOTA models. The work demonstrates the practicality of automated data creation for VRDU and outlines directions to extend the pipeline to historical and handwritten documents, enhancing robustness and scope of document intelligence research.

Abstract

Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.

Paper Structure

This paper contains 38 sections, 8 equations, 19 figures, 11 tables, 1 algorithm.

Figures (19)

  • Figure 1: Examples of VRDU Tasks in the MosaicDoc Benchmark.
  • Figure 2: Overview of DocWeaver, a Multi-Agent Pipeline for Generating the MosaicDoc Benchmark.
  • Figure 3: Left panels show the distributions of question similarity and token length, while right panel compares number of multi-span and single-span question-answer pairs instances
  • Figure 4: Source category distribution of MosaicDoc dataset.
  • Figure 5: The BLEU scores are calculated for the left-to-right and top-to-bottom order to measure the layout complexity of MosaicDoc.
  • ...and 14 more figures