Table of Contents
Fetching ...

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

Chong Zhang, Yi Tu, Yixi Zhao, Chenshu Yuan, Huan Chen, Yue Zhang, Mingxu Chai, Ya Guo, Huijia Zhu, Qi Zhang, Tao Gui

TL;DR

This work reframes layout reading order in visually-rich documents from a single permutation to a set of ordering relations, introducing Immediate Succession During Reading (ISDR) as a directed acyclic relation and Generalized Succession During Reading (GSDR) as its transitive closure. It provides ROOR, a benchmark with relation-level annotations, and develops a relation-extraction-based Reading Order Prediction (ROP) model that outperforms traditional sequence-based approaches. Furthermore, it proposes a reading-order-relation-enhancing (RORE) pipeline that injects a $n \times n$ reading-order matrix into relation-aware attention to improve downstream VrD tasks (VRD-IE/QA), achieving state-of-the-art results on several benchmarks and demonstrating cross-domain gains using pseudo labels. The approach underscores the practical value of explicitly modeling reading order as relations and offers a scalable way to leverage this information across diverse VrD applications.

Abstract

Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.

Modeling Layout Reading Order as Ordering Relations for Visually-rich Document Understanding

TL;DR

This work reframes layout reading order in visually-rich documents from a single permutation to a set of ordering relations, introducing Immediate Succession During Reading (ISDR) as a directed acyclic relation and Generalized Succession During Reading (GSDR) as its transitive closure. It provides ROOR, a benchmark with relation-level annotations, and develops a relation-extraction-based Reading Order Prediction (ROP) model that outperforms traditional sequence-based approaches. Furthermore, it proposes a reading-order-relation-enhancing (RORE) pipeline that injects a reading-order matrix into relation-aware attention to improve downstream VrD tasks (VRD-IE/QA), achieving state-of-the-art results on several benchmarks and demonstrating cross-domain gains using pseudo labels. The approach underscores the practical value of explicitly modeling reading order as relations and offers a scalable way to leverage this information across diverse VrD applications.

Abstract

Modeling and leveraging layout reading order in visually-rich documents (VrDs) is critical in document intelligence as it captures the rich structure semantics within documents. Previous works typically formulated layout reading order as a permutation of layout elements, i.e. a sequence containing all the layout elements. However, we argue that this formulation does not adequately convey the complete reading order information in the layout, which may potentially lead to performance decline in downstream VrD tasks. To address this issue, we propose to model the layout reading order as ordering relations over the set of layout elements, which have sufficient expressive capability for the complete reading order information. To enable empirical evaluation on methods towards the improved form of reading order prediction (ROP), we establish a comprehensive benchmark dataset including the reading order annotation as relations over layout elements, together with a relation-extraction-based method that outperforms previous methods. Moreover, to highlight the practical benefits of introducing the improved form of layout reading order, we propose a reading-order-relation-enhancing pipeline to improve model performance on any arbitrary VrD task by introducing additional reading order relation inputs. Comprehensive results demonstrate that the pipeline generally benefits downstream VrD tasks: (1) with utilizing the reading order relation information, the enhanced downstream models achieve SOTA results on both two task settings of the targeted dataset; (2) with utilizing the pseudo reading order information generated by the proposed ROP model, the performance of the enhanced models has improved across all three models and eight cross-domain VrD-IE/QA task settings without targeted optimization.
Paper Structure (37 sections, 7 equations, 6 figures, 8 tables)

This paper contains 37 sections, 7 equations, 6 figures, 8 tables.

Figures (6)

  • Figure 1: Motivation of reformulating layout reading order. In complex document layouts, multiple reading sequences are acceptable (displayed in the first three rows); thus the reading order information may be incomplete if represented by one single sequence. We propose to represent the relationship of immediate succession during reading among layout elements using a directed acyclic relation (displayed in the last row as a directed acyclic graph), ensuring that the complete layout reading order information is conveyed.
  • Figure 2: The reading-order-relation-enhancing pipeline (right, green) comparing with the original pipeline (left, blue) for general document processing. "RM" denotes Malaysian Ringgit.
  • Figure 3: Several example layouts with non-linear reading order. Annotations are drawn as block-level for better visualization. (a) The complex layout includes multiple possible reading sequences (illustrated in Fig. \ref{['fig:page1']}); (b) The reading order of header, footer and watermark within the layout are separated from the main body; (c) The table within the layout can be read either vertically or horizontally; (d) Indirect reading order relationship is also important as relevant elements may be separated by other contents.
  • Figure 4: Reading order relation information is represented as a $n*n$ binary matrix to be leveraged in downstream VrD tasks, where $n$ is the number of input textual tokens.
  • Figure 5: Case study of the proposed reading order prediction model. Each arrow represents a predicted relation linking between segments.
  • ...and 1 more figures