Table of Contents
Fetching ...

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

Haoli Bai, Zhiguang Liu, Xiaojun Meng, Wentao Li, Shuang Liu, Nian Xie, Rongfu Zheng, Liangwei Wang, Lu Hou, Jiansheng Wei, Xin Jiang, Qun Liu

TL;DR

Wukong-Reader advances visual document understanding by exploiting document textlines as a fine-grained cross-modal granularity. It introduces textline-region contrastive learning (TRC), masked region modeling (MRM), and textline grid matching (TGM) within a hybrid dual- and single-stream architecture that separately encodes visual and textual information before fusion. The model achieves state-of-the-art or competitive results on information extraction and document classification benchmarks, while also exhibiting strong textline-level localization capabilities. This work demonstrates that leveraging textline structure yields robust, fine-grained multimodal representations for diverse VDU tasks with practical localization benefits.

Abstract

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.

Wukong-Reader: Multi-modal Pre-training for Fine-grained Visual Document Understanding

TL;DR

Wukong-Reader advances visual document understanding by exploiting document textlines as a fine-grained cross-modal granularity. It introduces textline-region contrastive learning (TRC), masked region modeling (MRM), and textline grid matching (TGM) within a hybrid dual- and single-stream architecture that separately encodes visual and textual information before fusion. The model achieves state-of-the-art or competitive results on information extraction and document classification benchmarks, while also exhibiting strong textline-level localization capabilities. This work demonstrates that leveraging textline structure yields robust, fine-grained multimodal representations for diverse VDU tasks with practical localization benefits.

Abstract

Unsupervised pre-training on millions of digital-born or scanned documents has shown promising advances in visual document understanding~(VDU). While various vision-language pre-training objectives are studied in existing solutions, the document textline, as an intrinsic granularity in VDU, has seldom been explored so far. A document textline usually contains words that are spatially and semantically correlated, which can be easily obtained from OCR engines. In this paper, we propose Wukong-Reader, trained with new pre-training objectives to leverage the structural knowledge nested in document textlines. We introduce textline-region contrastive learning to achieve fine-grained alignment between the visual regions and texts of document textlines. Furthermore, masked region modeling and textline-grid matching are also designed to enhance the visual and layout representations of textlines. Experiments show that our Wukong-Reader has superior performance on various VDU tasks such as information extraction. The fine-grained alignment over textlines also empowers Wukong-Reader with promising localization ability.
Paper Structure (38 sections, 8 equations, 5 figures, 3 tables)

This paper contains 38 sections, 8 equations, 5 figures, 3 tables.

Figures (5)

  • Figure 1: Document textlines from the letter in FUNSD jaume2019funsd and receipt in SROIE huang2019icdar2019, respectively.
  • Figure 2: Architecture of the proposed Wukong-Reader. The scanned document is sent to the image encoder to extract visual features. Meanwhile, OCR tools are applied to extract words, bounding boxes as 2D positional embeddings to the text encoder. Wukong-Reader is pre-trained with 1) masked language modeling (MLM); 2) textline-region contrastive learning (TRC) to learn fine-grained textline alignment; 3) masked region modeling (MRM) to enhance the visual representation of textlines; and 4) textline grid matching (TGM) which classifies the words of selected textlines (blue) into different image grids (red). More details in Section \ref{['sec:pretrain_obj']}.
  • Figure 3: The training curves in terms of total loss and MLM loss for pre-training with different training objectives.
  • Figure 4: Visualization of learned textline-region alignment. The green and red textline bounding boxes denote the correct and incorrect alignment, respectively.
  • Figure 5: The ANLS scores of each category in DocVQA achieved by Wukong-Readerlarge.