Table of Contents
Fetching ...

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen

TL;DR

DocPTBench introduces the first end-to-end benchmark for photographed documents, exposing robustness gaps in both expert parsing systems and general multimodal LLMs when transitioning from pristine to real-world captured content. By integrating over 1,300 photographed documents across multiple domains, eight translation directions, and human-verified parsing and translation annotations, the benchmark enables fair comparisons between specialized parsers and general MLLMs. The study reveals substantial performance drops due to photographic distortions, demonstrates that unwarping helps but does not fully close the gap, and shows that decoupling perception from translation via Chain-of-Thought prompting improves end-to-end translation yet is not a universal solution. Overall, DocPTBench provides a realistic, open resource to push toward robust, real-world document intelligence systems that function under uncontrolled capture conditions.

Abstract

The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.

DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation

TL;DR

DocPTBench introduces the first end-to-end benchmark for photographed documents, exposing robustness gaps in both expert parsing systems and general multimodal LLMs when transitioning from pristine to real-world captured content. By integrating over 1,300 photographed documents across multiple domains, eight translation directions, and human-verified parsing and translation annotations, the benchmark enables fair comparisons between specialized parsers and general MLLMs. The study reveals substantial performance drops due to photographic distortions, demonstrates that unwarping helps but does not fully close the gap, and shows that decoupling perception from translation via Chain-of-Thought prompting improves end-to-end translation yet is not a universal solution. Overall, DocPTBench provides a realistic, open resource to push toward robust, real-world document intelligence systems that function under uncontrolled capture conditions.

Abstract

The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.

Paper Structure

This paper contains 24 sections, 26 figures, 8 tables.

Figures (26)

  • Figure 1: (a): the results of MLLMs on English (En)-started parsing (P) and translation (T) tasks; (b): the counterpart on Chinese (Zh)-started tasks; (c): the results from document parsing expert models. Ori- refers to the original digital-born document and Photographed- is its photographed version. Text- indicates that only the textual content of the document image is used as the source-language input. A lower Edit distance indicates higher parsing quality, and a higher BLEU score reflects better translation fidelity.
  • Figure 2: Overview of the DocPTBench benchmark construction.
  • Figure 3: Document parsing results of PaddleOCR-VL across different document conditions. The images visually corroborate our quantitative findings, demonstrating the sharp decline in parsing quality on photographed images and the significant improvement after unwarping. More cases are presented in the supplementary.
  • Figure 4: Visualization of Qwen3-VL-4B end-to-end translation results on DocPTBench. (a) shows the document image input. (b) illustrates a failure case where using a simple prompt causes the model to only perform OCR without translating. (c) demonstrates that the CoT prompt successfully rectifies this failure, guiding the model to produce the correct translation.
  • Figure 5: Illustration of the prompting strategies employed in our document translation experiments. (a) Text Prompt: Used for the text-only machine translation baseline. (b) Simple Prompt: Direct instruction for end-to-end visual document translation. (c) CoT Prompt: A Chain-of-Thought approach that explicitly instructs the model to perform OCR recognition prior to translation to mitigate modality gaps.
  • ...and 21 more figures