Table of Contents
Fetching ...

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

Jiarui Zhang, Yuliang Liu, Zijun Wu, Guosheng Pang, Zhili Ye, Yupei Zhong, Junteng Ma, Tao Wei, Haiyang Xu, Weikai Chen, Zeen Wang, Qiangjun Ji, Fanxi Zhou, Qi Zhang, Yuanrui Hu, Jiahao Liu, Zhang Li, Ziyang Zhang, Qiang Liu, Xiang Bai

TL;DR

MonkeyOCR v1.5 addresses robust document parsing for complex layouts by coupling a two-stage vision-language pipeline with stage I layout/reading-order prediction and stage II region-level recognition. It introduces visual-consistency reinforcement learning to refine table structures without dense annotations, and two dedicated modules—Image-Decoupled Table Parsing and Type-Guided Table Merging—to handle embedded images and cross-page/cross-column tables. Experiments on OmniDocBench v1.5, PubTabNet, and OCRFlux-pubtabnet-single demonstrate state-of-the-art accuracy and robustness, particularly for complex tables and heterogeneous document types. The approach offers a scalable, high-fidelity OCR solution with strong practical potential as a foundation model for document understanding.

Abstract

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage pipeline. The first stage employs a large multimodal model to jointly predict layout and reading order, leveraging visual information to ensure sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios. A trial link can be found at https://github.com/Yuliang-Liu/MonkeyOCR .

MonkeyOCR v1.5 Technical Report: Unlocking Robust Document Parsing for Complex Patterns

TL;DR

MonkeyOCR v1.5 addresses robust document parsing for complex layouts by coupling a two-stage vision-language pipeline with stage I layout/reading-order prediction and stage II region-level recognition. It introduces visual-consistency reinforcement learning to refine table structures without dense annotations, and two dedicated modules—Image-Decoupled Table Parsing and Type-Guided Table Merging—to handle embedded images and cross-page/cross-column tables. Experiments on OmniDocBench v1.5, PubTabNet, and OCRFlux-pubtabnet-single demonstrate state-of-the-art accuracy and robustness, particularly for complex tables and heterogeneous document types. The approach offers a scalable, high-fidelity OCR solution with strong practical potential as a foundation model for document understanding.

Abstract

Document parsing is a core task in document intelligence, supporting applications such as information extraction, retrieval-augmented generation, and automated document analysis. However, real-world documents often feature complex layouts with multi-level tables, embedded images or formulas, and cross-page structures, which remain challenging for existing OCR systems. We introduce MonkeyOCR v1.5, a unified vision-language framework that enhances both layout understanding and content recognition through a two-stage pipeline. The first stage employs a large multimodal model to jointly predict layout and reading order, leveraging visual information to ensure sequential consistency. The second stage performs localized recognition of text, formulas, and tables within detected regions, maintaining high visual fidelity while reducing error propagation. To address complex table structures, we propose a visual consistency-based reinforcement learning scheme that evaluates recognition quality via render-and-compare alignment, improving structural accuracy without manual annotations. Additionally, two specialized modules, Image-Decoupled Table Parsing and Type-Guided Table Merging, are introduced to enable reliable parsing of tables containing embedded images and reconstruction of tables crossing pages or columns. Comprehensive experiments on OmniDocBench v1.5 demonstrate that MonkeyOCR v1.5 achieves state-of-the-art performance, outperforming PPOCR-VL and MinerU 2.5 while showing exceptional robustness in visually complex document scenarios. A trial link can be found at https://github.com/Yuliang-Liu/MonkeyOCR .

Paper Structure

This paper contains 15 sections, 6 equations, 11 figures, 4 tables.

Figures (11)

  • Figure 1: Performance comparison of MonkeyOCR v1.5 and other SOTA models.
  • Figure 2: Rapid growth of document parsing methods since June 2025.
  • Figure 3: The overall pipeline of MonkeyOCR v1.5, which first detect all layout elements with order index and then recognize contents in a parallel way using a VLM.
  • Figure 4: Visual consistency based GRPO. For each input $x$ containing the original image $I^{\mathcal{O}}$, the policy model generates a response $y$. A renderer produces $I^{\mathcal{R}}$. The triplet $(I^{\mathcal{O}}, y, I^{\mathcal{R}})$ is evaluated by a composite reward that combines a rule-based check with a VLM reward model.
  • Figure 5: Pipeline for tables with embedded images. The pipeline detects embedded images, replaces them with size-accurate placeholders, performs recognition to generate HTML with <img> tags, and finally re-inserts the original images to reconstruct the table.
  • ...and 6 more figures