DocPTBench: Benchmarking End-to-End Photographed Document Parsing and Translation
Yongkun Du, Pinxuan Chen, Xuye Ying, Zhineng Chen
TL;DR
DocPTBench introduces the first end-to-end benchmark for photographed documents, exposing robustness gaps in both expert parsing systems and general multimodal LLMs when transitioning from pristine to real-world captured content. By integrating over 1,300 photographed documents across multiple domains, eight translation directions, and human-verified parsing and translation annotations, the benchmark enables fair comparisons between specialized parsers and general MLLMs. The study reveals substantial performance drops due to photographic distortions, demonstrates that unwarping helps but does not fully close the gap, and shows that decoupling perception from translation via Chain-of-Thought prompting improves end-to-end translation yet is not a universal solution. Overall, DocPTBench provides a realistic, open resource to push toward robust, real-world document intelligence systems that function under uncontrolled capture conditions.
Abstract
The advent of Multimodal Large Language Models (MLLMs) has unlocked the potential for end-to-end document parsing and translation. However, prevailing benchmarks such as OmniDocBench and DITrans are dominated by pristine scanned or digital-born documents, and thus fail to adequately represent the intricate challenges of real-world capture conditions, such as geometric distortions and photometric variations. To fill this gap, we introduce DocPTBench, a comprehensive benchmark specifically designed for Photographed Document Parsing and Translation. DocPTBench comprises over 1,300 high-resolution photographed documents from multiple domains, includes eight translation scenarios, and provides meticulously human-verified annotations for both parsing and translation. Our experiments demonstrate that transitioning from digital-born to photographed documents results in a substantial performance decline: popular MLLMs exhibit an average accuracy drop of 18% in end-to-end parsing and 12% in translation, while specialized document parsing models show significant average decrease of 25%. This substantial performance gap underscores the unique challenges posed by documents captured in real-world conditions and reveals the limited robustness of existing models. Dataset and code are available at https://github.com/Topdu/DocPTBench.
