Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing
Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi
TL;DR
This paper tackles robust end-to-end scanned document parsing by introducing LayoutRL, a reinforcement learning framework that explicitly optimizes layout-aware rewards to learn structural representations beyond surface-token matching. It pairs a large-scale Infinity-Doc-400K dataset with Infinity-Parser, a vision-language model-based parser, to enable direct translation from visual input to structured Markdown-like layouts. The core novelty lies in the multi-aspect reward design, combining $R_{ ext{dist}}$, $R_{ ext{count}}$, and $R_{ ext{order}}$ under Group Relative Policy Optimization, which improves both local fidelity and global reading order across diverse English and Chinese benchmarks. Empirical results show state-of-the-art performance on OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet, with enhanced training stability and robust generalization to unseen document types, underscoring the method’s potential for scalable, layout-aware document intelligence. The authors also commit to releasing the dataset and code to accelerate reproducibility and broader adoption.
Abstract
Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.
