Table of Contents
Fetching ...

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

Qiming Peng, Yinxu Pan, Wenjin Wang, Bin Luo, Zhenyu Zhang, Zhengjie Huang, Teng Hu, Weichong Yin, Yongfeng Chen, Yin Zhang, Shikun Feng, Yu Sun, Hao Tian, Hua Wu, Haifeng Wang

TL;DR

ERNIE-Layout addresses VrDU by systematically injecting layout knowledge into pre-training through a layout-aware serialization module and a spatially aware, disentangled attention mechanism. It introduces reading order prediction and a replaced region pre-training task to strengthen cross-modal alignment among text, layout, and image. The model achieves state-of-the-art results across multiple key information extraction, document question answering, and document image classification datasets, demonstrating the value of treating layout as a core modality in multimodal document understanding. Overall, the work highlights the practical impact of leveraging layout knowledge for more human-aligned and robust VrDU representations.

Abstract

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.

ERNIE-Layout: Layout Knowledge Enhanced Pre-training for Visually-rich Document Understanding

TL;DR

ERNIE-Layout addresses VrDU by systematically injecting layout knowledge into pre-training through a layout-aware serialization module and a spatially aware, disentangled attention mechanism. It introduces reading order prediction and a replaced region pre-training task to strengthen cross-modal alignment among text, layout, and image. The model achieves state-of-the-art results across multiple key information extraction, document question answering, and document image classification datasets, demonstrating the value of treating layout as a core modality in multimodal document understanding. Overall, the work highlights the practical impact of leveraging layout knowledge for more human-aligned and robust VrDU representations.

Abstract

Recent years have witnessed the rise and success of pre-training techniques in visually-rich document understanding. However, most existing methods lack the systematic mining and utilization of layout-centered knowledge, leading to sub-optimal performances. In this paper, we propose ERNIE-Layout, a novel document pre-training solution with layout knowledge enhancement in the whole workflow, to learn better representations that combine the features from text, layout, and image. Specifically, we first rearrange input sequences in the serialization stage, and then present a correlative pre-training task, reading order prediction, to learn the proper reading order of documents. To improve the layout awareness of the model, we integrate a spatial-aware disentangled attention into the multi-modal transformer and a replaced regions prediction task into the pre-training phase. Experimental results show that ERNIE-Layout achieves superior performance on various downstream tasks, setting new state-of-the-art on key information extraction, document image classification, and document question answering datasets. The code and models are publicly available at http://github.com/PaddlePaddle/PaddleNLP/tree/develop/model_zoo/ernie-layout.
Paper Structure (20 sections, 10 equations, 4 figures, 9 tables)

This paper contains 20 sections, 10 equations, 4 figures, 9 tables.

Figures (4)

  • Figure 1: The architecture and pre-training objectives of ERNIE-Layout. The serialization module is introduced to correct the order of raster-scan, and the visual encoder extracts corresponding image features. With the spatial-aware disentangled attention mechanism, ERNIE-Layout is pre-trained with four tasks.
  • Figure 2: The effect of layout knowledge enhanced serialization compared with vanilla raster-scanning order. By using Document-Parser, the perplexity of the document with a complex layout is significantly reduced.
  • Figure 3: The internal working principle of spatial-aware disentangled attention.
  • Figure 4: The example of a document with a complex layout. The serialization result with the raster-scanning order is "... Session Chair: Session Chair: Session Chair: Tuula Hakkarainen ...", while serialization with Document-Parser is "... Session Chair: Tuula wz Session Chair: Frank Markert ...", which is more consistent with human reading habits.