Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding
Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, Xin Jiang, Qun Liu
TL;DR
Wukong-Reader advances visual document understanding by exploiting document textlines as a fine-grained cross-modal granularity. It introduces textline-region contrastive learning (TRC), masked region modeling (MRM), and textline grid matching (TGM) within a hybrid dual- and single-stream architecture that separately encodes visual and textual information before fusion. The model achieves state-of-the-art or competitive results on information extraction and document classification benchmarks, while also exhibiting strong textline-level localization capabilities. This work demonstrates that leveraging textline structure yields robust, fine-grained multimodal representations for diverse VDU tasks with practical localization benefits.
Abstract
Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.
