Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li; Pengyuan Lyu; Chengquan Zhang; Huawen Shen; Liang Wu; Xingyu Wan; Gangyan Zeng; Han Hu; Can Ma; Yu Zhou

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Gengluo Li, Pengyuan Lyu, Chengquan Zhang, Huawen Shen, Liang Wu, Xingyu Wan, Gangyan Zeng, Han Hu, Can Ma, Yu Zhou

Abstract

Document parsing has recently advanced with multimodal large language models (MLLMs) that directly map document images to structured outputs. Traditional cascaded pipelines depend on precise layout analysis and often fail under casually captured or non-standard conditions. Although end-to-end approaches mitigate this dependency, they still exhibit repetitive, hallucinated, and structurally inconsistent predictions - primarily due to the scarcity of large-scale, high-quality full-page (document-level) end-to-end parsing data and the lack of structure-aware training strategies. To address these challenges, we propose a data-training co-design framework for robust end-to-end document parsing. A Realistic Scene Synthesis strategy constructs large-scale, structurally diverse full-page end-to-end supervision by composing layout templates with rich document elements, while a Document-Aware Training Recipe introduces progressive learning and structure-token optimization to enhance structural fidelity and decoding stability. We further build Wild-OmniDocBench, a benchmark derived from real-world captured documents for robustness evaluation. Integrated into a 1B-parameter MLLM, our method achieves superior accuracy and robustness across both scanned/digital and real-world captured scenarios. All models, data synthesis pipelines, and benchmarks will be publicly released to advance future research in document understanding.

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Abstract

Paper Structure (16 sections, 2 equations, 4 figures, 5 tables)

This paper contains 16 sections, 2 equations, 4 figures, 5 tables.

Introduction
Related Work
Realistic Scene Synthesis
Document-Aware Training Recipe
Wild-OmniDocBench
Experiments
Datasets and Evaluation
Training Data.
Evaluation Benchmarks and Metrics.
Baseline Settings.
Evaluation on Printed Document Parsing
Ablation Study
Data Benefits and Scaling Law
Analyzing Repetitive Decoding in End-to-End Parsing
Limitations and Future Work
...and 1 more sections

Figures (4)

Figure 1: Overall Performance and Degradation from OmniDocBench to Wild-OmniDocBench. Underlined method names correspond to modular cascaded pipelines.
Figure 2: Scanned/Digital and Real-World Capture. On scanned/digital pages, both modular and E2E parsers decode correctly. Under real-world capture, modular cascades accumulate layout-analysis errors that propagate to element parsing (extra/missing regions), while generic end-to-end models exhibit repetitive outputs.
Figure 3: Overview of Realistic Scene Synthesis. Left: repositories of atomic elements and layout templates with reading order. Right: a synthesis pipeline that composes sampled elements into templates under spatial/structural constraints to produce page-level annotations, followed by capture-aware augmentation to simulate real-world images.
Figure 4: Wild-OmniDocBench Construction. We convert scanned pages into real-world–captured images by (i) printing, deforming, and photographing under varied lighting, and (ii) displaying on screens and re-shooting to induce moiré and reflections.

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Abstract

Towards Real-World Document Parsing via Realistic Scene Synthesis and Document-Aware Training

Authors

Abstract

Table of Contents

Figures (4)