Table of Contents
Fetching ...

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Bin Wang, Tianyao He, Linke Ouyang, Fan Wu, Zhiyuan Zhao, Tao Chu, Yuan Qu, Zhenjiang Jin, Weijun Zeng, Ziyang Miao, Bangrui Xu, Junbo Niu, Mengzhang Cai, Jiantao Qiu, Qintong Zhang, Dongsheng Ma, Yuefeng Sun, Hejun Dong, Wenzheng Zhang, Jutao Xiao, Jiayong Shi, Pengyu Liao, Xiaomeng Zhao, Huaping Zhong, Liqun Wei, Jing Yu, Jie Yang, Wei Li, Shasha Wang, Qianqian Wu, Xuanhe Zhou, Weijia Li, Zhenxiang Li, Zhongying Tu, Jiang Wu, Lijun Wu, Chao Xu, Kai Chen, Wentao Zhang, Yu Qiao, Bowen Zhou, Dahua Lin, Conghui He

Abstract

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200$\times$ more parameters.

MinerU2.5-Pro: Pushing the Limits of Data-Centric Document Parsing at Scale

Abstract

Current document parsing methods compete primarily on model architecture innovation, while systematic engineering of training data remains underexplored. Yet SOTA models of different architectures and parameter scales exhibit highly consistent failure patterns on the same set of hard samples, suggesting that the performance bottleneck stems from shared deficiencies in training data rather than architecture itself. Building on this finding, we present \minerupro, which advances the state of the art solely through data engineering and training strategy optimization while keeping the 1.2B-parameter architecture of \mineru completely fixed. At its core is a Data Engine co-designed around coverage, informativeness, and annotation accuracy: Diversity-and-Difficulty-Aware Sampling expands training data from under 10M to 65.5M samples while correcting distribution shift; Cross-Model Consistency Verification leverages output agreement among heterogeneous models to assess sample difficulty and generate reliable annotations; the Judge-and-Refine pipeline improves annotation quality for hard samples through render-then-verify iterative correction. A three-stage progressive training strategy -- large-scale pre-training, hard sample fine-tuning, and GRPO alignment -- sequentially exploits these data at different quality tiers. On the evaluation front, we fix element-matching biases in OmniDocBench~v1.5 and introduce a Hard subset, establishing the more discriminative OmniDocBench~v1.6 protocol. Without any architectural modification, \minerupro achieves 95.69 on OmniDocBench~v1.6, improving over the same-architecture baseline by 2.71 points and surpassing all existing methods including models with over 200 more parameters.

Paper Structure

This paper contains 82 sections, 1 equation, 18 figures, 8 tables.

Figures (18)

  • Figure 1: Performance comparison on OmniDocBench v1.6 Base/Hard/Full. Built upon MinerU2.5 with its 1.2B-parameter architecture entirely unchanged, MinerU2.5-Pro improves the overall score from 92.98 to 95.69 solely through data engineering and training strategy optimization, outperforming both specialized document parsing models (e.g. GLM-OCR, PaddleOCR-VL-1.5, Youtu-Parsing) and general-purpose VLMs (e.g. Gemini 3 Pro, Qwen3-VL-235B). Detailed results are presented in \ref{['tab:main_results']}.
  • Figure 2: Overview of the Data Engine pipeline. The system co-optimizes three dimensions---Coverage, Informativeness, and Accuracy---through four synergistic stages: Diversity-and-Difficulty-Aware Sampling (DDAS), Cross-Model Consistency Verification (CMCV), Judge-and-Refine annotation correction, and targeted expert annotation.
  • Figure 3: The DDAS pipeline operates at two granularity levels. Upper: Page-level sampling for layout detection data---pages from the PDF pool are embedded via ViT-base, clustered, and resampled by jointly weighting cluster diversity and CMCV-derived difficulty, yielding about 60M pages with balanced distribution and difficulty coverage. Lower: Element-level sampling---the selected pages are parsed by layout detection models into text, formula, and table blocks; each element type is independently clustered and assessed by CMCV, then sampled to balance both diversity and difficulty at the element granularity. The two levels are combined to produce the final training data for layout, text, formula, and table subtasks.
  • Figure 4: Examples of element-matching bias in OmniDocBench v1.5. Semantically correct predictions receive low scores due to granularity mismatch between predicted and ground-truth segmentation.
  • Figure 5: Layout Detection examples. The model localizes content regions with bounding boxes, category labels, and rotation flags on diverse document pages.
  • ...and 13 more figures