Table of Contents
Fetching ...

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

Hao Feng, Wei Shi, Ke Zhang, Xiang Fei, Lei Liao, Dingkang Yang, Yongkun Du, Xuecheng Wu, Jingqun Tang, Yang Liu, Hong Chen, Can Huang

TL;DR

Dolphin-v2 tackles fragmentation and distortion sensitivity in document parsing by introducing a document-type aware two-stage framework. Stage 1 jointly classifies documents as digital or photographed and derives layout anchors, while Stage 2 applies a hybrid parsing strategy that holistically parses photographed pages and performs parallel, anchor-guided parsing for digital layouts, with 21 fine-grained element types and dedicated modules for formulas and code blocks. The approach yields substantial gains on OmniDocBench (+14.78 points) and dramatic reductions in errors on RealDoc-160, while maintaining efficiency via parallel processing. This work advances universal document parsing with scalable anchor prompting, enabling robust, efficient parsing across diverse document types and real-world capture conditions.

Abstract

Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.

Dolphin-v2: Universal Document Parsing via Scalable Anchor Prompting

TL;DR

Dolphin-v2 tackles fragmentation and distortion sensitivity in document parsing by introducing a document-type aware two-stage framework. Stage 1 jointly classifies documents as digital or photographed and derives layout anchors, while Stage 2 applies a hybrid parsing strategy that holistically parses photographed pages and performs parallel, anchor-guided parsing for digital layouts, with 21 fine-grained element types and dedicated modules for formulas and code blocks. The approach yields substantial gains on OmniDocBench (+14.78 points) and dramatic reductions in errors on RealDoc-160, while maintaining efficiency via parallel processing. This work advances universal document parsing with scalable anchor prompting, enabling robust, efficient parsing across diverse document types and real-world capture conditions.

Abstract

Document parsing has garnered widespread attention as vision-language models (VLMs) advance OCR capabilities. However, the field remains fragmented across dozens of specialized models with varying strengths, forcing users to navigate complex model selection and limiting system scalability. Moreover, existing two-stage approaches depend on axis-aligned bounding boxes for layout detection, failing to handle distorted or photographed documents effectively. To this end, we present Dolphin-v2, a two-stage document image parsing model that substantially improves upon the original Dolphin. In the first stage, Dolphin-v2 jointly performs document type classification (digital-born versus photographed) alongside layout analysis. For digital-born documents, it conducts finer-grained element detection with reading order prediction. In the second stage, we employ a hybrid parsing strategy: photographed documents are parsed holistically as complete pages to handle geometric distortions, while digital-born documents undergo element-wise parallel parsing guided by the detected layout anchors, enabling efficient content extraction. Compared with the original Dolphin, Dolphin-v2 introduces several crucial enhancements: (1) robust parsing of photographed documents via holistic page-level understanding, (2) finer-grained element detection (21 categories) with semantic attribute extraction such as author information and document metadata, and (3) code block recognition with indentation preservation, which existing systems typically lack. Comprehensive evaluations are conducted on DocPTBench, OmniDocBench, and our self-constructed RealDoc-160 benchmark. The results demonstrate substantial improvements: +14.78 points overall on the challenging OmniDocBench and 91% error reduction on photographed documents, while maintaining efficient inference through parallel processing.
Paper Structure (16 sections, 10 figures, 8 tables)

This paper contains 16 sections, 10 figures, 8 tables.

Figures (10)

  • Figure 1: Performance comparison between Dolphin feng2025dolphin and Dolphin-v2 across diverse document scenarios on OmniDocBench ouyang2024omnidocbenchbenchmarkingdiversepdf. All metrics are normalized to 0-100 scale where higher is better.
  • Figure 2: Timeline illustrating the development of multi-stage and one-stage vision-language models for document image parsing.
  • Figure 3: Overview of the two-stage document image parsing paradigm in Dolphin-v2. It consists of Stage 1 for page-level document type classification (photographed vs. digital) and layout analysis that generates structured layout sequences in reading order, as well as Stage 2 for hybrid content parsing, where photographed documents are parsed holistically while digital documents undergo element-wise parallel parsing.
  • Figure 4: Examples of input-output pairs of Dolphin-v2, including page-level layout analysis and element-level content parsing for text paragraphs, tables, formulas, and codes. "$P_*$" denotes different prompts. Each element type is parsed into its corresponding format (e.g., HTML for tables, LaTeX for formulas, indented text for code).
  • Figure 5: Comparison of the parsing results on a photographed document with distortions, perspective transformations, and blur. Dolphin-v2 accurately parses the content, while the advanced MinerU2.5 niu2025mineru25decoupledvisionlanguagemodel fails to handle such challenging conditions.
  • ...and 5 more figures