MosaicDoc: A Large-Scale Bilingual Benchmark for Visually Rich Document Understanding
Ketong Chen, Yuhao Chen, Yang Xue
TL;DR
MosaicDoc targets Visually Rich Document Understanding by addressing the lack of realistic, multilingual benchmarks. The authors introduce DocWeaver, a fully automated, multi-agent pipeline powered by LLMs to generate high-fidelity, multi-task annotations for complex newspaper and magazine documents, resulting in MosaicDoc with 72K images and over 600K QA pairs across OCR, DocVQA, reading order, and localization in English and Chinese. The benchmark exposes limitations in current models, particularly in dense layouts and multi-span reasoning, and establishes a new, more challenging baseline via extensive evaluation of 13 SOTA models. The work demonstrates the practicality of automated data creation for VRDU and outlines directions to extend the pipeline to historical and handwritten documents, enhancing robustness and scope of document intelligence research.
Abstract
Despite the rapid progress of Vision-Language Models (VLMs), their capabilities are inadequately assessed by existing benchmarks, which are predominantly English-centric, feature simplistic layouts, and support limited tasks. Consequently, they fail to evaluate model performance for Visually Rich Document Understanding (VRDU), a critical challenge involving complex layouts and dense text. To address this, we introduce DocWeaver, a novel multi-agent pipeline that leverages Large Language Models to automatically generate a new benchmark. The result is MosaicDoc, a large-scale, bilingual (Chinese and English) resource designed to push the boundaries of VRDU. Sourced from newspapers and magazines, MosaicDoc features diverse and complex layouts (including multi-column and non-Manhattan), rich stylistic variety from 196 publishers, and comprehensive multi-task annotations (OCR, VQA, reading order, and localization). With 72K images and over 600K QA pairs, MosaicDoc serves as a definitive benchmark for the field. Our extensive evaluation of state-of-the-art models on this benchmark reveals their current limitations in handling real-world document complexity and charts a clear path for future research.
