Table of Contents
Fetching ...

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

Zhiming Mao, Haoli Bai, Lu Hou, Jiansheng Wei, Xin Jiang, Qun Liu, Kam-Fai Wong

TL;DR

This work proposes visually guided generative text-layout pre-training, named ViTLP, and introduces a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length.

Abstract

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.

Visually Guided Generative Text-Layout Pre-training for Document Intelligence

TL;DR

This work proposes visually guided generative text-layout pre-training, named ViTLP, and introduces a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length.

Abstract

Prior study shows that pre-training techniques can boost the performance of visual document understanding (VDU), which typically requires models to gain abilities to perceive and reason both document texts and layouts (e.g., locations of texts and table-cells). To this end, we propose visually guided generative text-layout pre-training, named ViTLP. Given a document image, the model optimizes hierarchical language and layout modeling objectives to generate the interleaved text and layout sequence. In addition, to address the limitation of processing long documents by Transformers, we introduce a straightforward yet effective multi-segment generative pre-training scheme, facilitating ViTLP to process word-intensive documents of any length. ViTLP can function as a native OCR model to localize and recognize texts of document images. Besides, ViTLP can be effectively applied to various downstream VDU tasks. Extensive experiments show that ViTLP achieves competitive performance over existing baselines on benchmark VDU tasks, including information extraction, document classification, and document question answering.
Paper Structure (43 sections, 11 equations, 8 figures, 6 tables)

This paper contains 43 sections, 11 equations, 8 figures, 6 tables.

Figures (8)

  • Figure 1: An overview workflow of the proposed ViTLP. Given a document image as input, ViTLP can generate sequences of text and layout (i.e., word bounding boxes) for various VDU tasks with task-specific prefixes.
  • Figure 2: Overview of the ViTLP architecture. ViTLP is a generative pre-training model that performs autoregressive text-layout modeling conditioned on visual document inputs. ViTLP adopts hierarchical decoder heads to generate target text-layout sequences in a global-to-local manner. The segment mode tokens $\in\{ \texttt{[BOS]}, \texttt{[CONT]}\}$ prompt the beginning and continuous modes of generation, respectively.
  • Figure 3: Visualization of ViTLP generated answers on DocVQA. The ViTLP output answer sequences consist of answer words (in blue) and corresponding location coordinates (in red). For direct visualization, we draw the region of interest (ROI) referring to the output layout coordinates on the image.
  • Figure 4: Distribution of document sequence lengths. The text sequences are tokenized by the standard BPE tokenizer GPT2.
  • Figure 5: ViTLP OCR results on a webpage. For comprehensive visualization, we render the output texts (in blue) and bounding boxes (in red) according to the ViTLP's interleaved output sequence.
  • ...and 3 more figures