Table of Contents
Fetching ...

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

Zhouqiang Jiang, Bowen Wang, Junhao Chen, Yuta Nakashima

TL;DR

A new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups and proposes a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together.

Abstract

Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

ReLayout: Towards Real-World Document Understanding via Layout-enhanced Pre-training

TL;DR

A new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups and proposes a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together.

Abstract

Recent approaches for visually-rich document understanding (VrDU) uses manually annotated semantic groups, where a semantic group encompasses all semantically relevant but not obviously grouped words. As OCR tools are unable to automatically identify such grouping, we argue that current VrDU approaches are unrealistic. We thus introduce a new variant of the VrDU task, real-world visually-rich document understanding (ReVrDU), that does not allow for using manually annotated semantic groups. We also propose a new method, ReLayout, compliant with the ReVrDU scenario, which learns to capture semantic grouping through arranging words and bringing the representations of words that belong to the potential same semantic group closer together. Our experimental results demonstrate the performance of existing methods is deteriorated with the ReVrDU task, while ReLayout shows superiour performance.

Paper Structure

This paper contains 26 sections, 8 equations, 9 figures, 3 tables.

Figures (9)

  • Figure 1: Result of LayoutMASK tu2023layoutmask for Semantic Entity Classification using different segments. :Segment, :QUESTION, :ANSWER.
  • Figure 2: Architecture of ReLayout: MLM masks word-level tokens and reconstructs the original tokens. 1-LOP masks global 1D positions at the text segment and reconstructs local 1D positions. 2-TSC uses self-supervised techniques to adaptively cluster the representations of text segments that belong to the same semantic group.
  • Figure 3: Two examples document images from CORD.
  • Figure 4: Visualization of pre-trained representations.
  • Figure 5: Visualization of representations learned under different pre-training tasks.
  • ...and 4 more figures