Table of Contents
Fetching ...

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

Baode Wang, Biao Wu, Weizhen Li, Meng Fang, Zuming Huang, Jun Huang, Haozhe Wang, Yanjie Liang, Ling Chen, Wei Chu, Yuan Qi

TL;DR

This paper tackles robust end-to-end scanned document parsing by introducing LayoutRL, a reinforcement learning framework that explicitly optimizes layout-aware rewards to learn structural representations beyond surface-token matching. It pairs a large-scale Infinity-Doc-400K dataset with Infinity-Parser, a vision-language model-based parser, to enable direct translation from visual input to structured Markdown-like layouts. The core novelty lies in the multi-aspect reward design, combining $R_{ ext{dist}}$, $R_{ ext{count}}$, and $R_{ ext{order}}$ under Group Relative Policy Optimization, which improves both local fidelity and global reading order across diverse English and Chinese benchmarks. Empirical results show state-of-the-art performance on OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet, with enhanced training stability and robust generalization to unseen document types, underscoring the method’s potential for scalable, layout-aware document intelligence. The authors also commit to releasing the dataset and code to accelerate reproducibility and broader adoption.

Abstract

Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

Infinity Parser: Layout Aware Reinforcement Learning for Scanned Document Parsing

TL;DR

This paper tackles robust end-to-end scanned document parsing by introducing LayoutRL, a reinforcement learning framework that explicitly optimizes layout-aware rewards to learn structural representations beyond surface-token matching. It pairs a large-scale Infinity-Doc-400K dataset with Infinity-Parser, a vision-language model-based parser, to enable direct translation from visual input to structured Markdown-like layouts. The core novelty lies in the multi-aspect reward design, combining , , and under Group Relative Policy Optimization, which improves both local fidelity and global reading order across diverse English and Chinese benchmarks. Empirical results show state-of-the-art performance on OmniDocBench, olmOCR-Bench, PubTabNet, and FinTabNet, with enhanced training stability and robust generalization to unseen document types, underscoring the method’s potential for scalable, layout-aware document intelligence. The authors also commit to releasing the dataset and code to accelerate reproducibility and broader adoption.

Abstract

Automated parsing of scanned documents into richly structured, machine-readable formats remains a critical bottleneck in Document AI, as traditional multi-stage pipelines suffer from error propagation and limited adaptability to diverse layouts. We introduce layoutRL, an end-to-end reinforcement learning framework that trains models to be explicitly layout-aware by optimizing a composite reward of normalized edit distance, paragraph count accuracy, and reading order preservation. Leveraging our newly released dataset, Infinity-Doc-55K, which combines 55K high-fidelity synthetic scanned document parsing data with expert-filtered real-world documents, we instantiate layoutRL in a vision-language-model-based parser called Infinity-Parser. Evaluated on English and Chinese benchmarks for OCR, table and formula extraction, and reading order detection, Infinity-Parser achieves new state-of-the-art performance in both accuracy and structural fidelity, outpacing specialist pipelines and general-purpose vision-language models. We will publicly release our code and dataset to accelerate progress in robust document understanding.

Paper Structure

This paper contains 35 sections, 4 equations, 15 figures, 10 tables.

Figures (15)

  • Figure 1: Comparison of document parsing performance on OmniDocBench under different training strategies as training data size increases. Left: Evaluation with two complementary metrics: (1) Paragraph-level accuracy (edit distance evaluation on element contents only), which assesses element-wise consistency within individual element contents, independent of inter-element reading order; and (2) Page-level accuracy (edit distance evaluation on element contents and reading order), which measures global document reconstruction quality by aligning predicted outputs (e.g., texts, tables, and formulas) with ground-truth sequences. Right: In-Distribution and Out-of-Distribution task performance measured by accuracy score (1 -- NED). See detailed descriptions of the task in Section \ref{['subsec:further_analysis']}.
  • Figure 2: Data construction pipelines for document parsing. (a) Real-world pipelines enhance quality by combining multiple expert models and layout analysis, yielding better-aligned supervision through intersection and reading order reasoning. (b) Synthetic pipeline leverages structured HTML templates and browser rendering to generate clean, exactly-aligned scanned document parsing data, ensuring high-quality supervision for end-to-end parsing.
  • Figure 3: Overview of Infinity-Parser training framework. Our model is optimized via reinforcement finetuning with edit distance, layout, and order-based rewards.
  • Figure 4: Performance comparison of SFT and Layout-Aware RL on OmniDocBench sub-tasks.
  • Figure 5: Comparison of model performance on different document parsering tasks.
  • ...and 10 more figures