Table of Contents
Fetching ...

RE$^2$: Region-Aware Relation Extraction from Visually Rich Documents

Pritika Ramu, Sijia Wang, Lalla Mouatadid, Joy Rimchala, Lifu Huang

TL;DR

This paper addresses relation extraction in visually rich documents by exploiting region-level spatial structure. It introduces $RE^2$, a framework that combines LayoutXLM-based entity encoding with an edge-aware graph attention network on a region-aware bipartite Q–A graph, complemented by a constraint loss to enforce one-to-one Q–A mappings. The approach uses multi-granular region bounding boxes (entity-level, paragraph-level, tabular-level) to compute edge representations and applies a biaffine classifier for binary relation prediction, achieving state-of-the-art performance on FUNSD, XFUND, and the newly proposed DiverseForm across supervised, multitask, and zero-shot cross-lingual transfer settings. Ablation studies confirm the benefits of joint node-edge embeddings and region-aware information, and the DiverseForm dataset enables cross-domain evaluation, highlighting the practical impact of region-aware VRD relation extraction in multilingual and multi-domain contexts.

Abstract

Current research in form understanding predominantly relies on large pre-trained language models, necessitating extensive data for pre-training. However, the importance of layout structure (i.e., the spatial relationship between the entity blocks in the visually rich document) to relation extraction has been overlooked. In this paper, we propose REgion-Aware Relation Extraction (RE$^2$) that leverages region-level spatial structure among the entity blocks to improve their relation prediction. We design an edge-aware graph attention network to learn the interaction between entities while considering their spatial relationship defined by their region-level representations. We also introduce a constraint objective to regularize the model towards consistency with the inherent constraints of the relation extraction task. Extensive experiments across various datasets, languages and domains demonstrate the superiority of our proposed approach.

RE$^2$: Region-Aware Relation Extraction from Visually Rich Documents

TL;DR

This paper addresses relation extraction in visually rich documents by exploiting region-level spatial structure. It introduces , a framework that combines LayoutXLM-based entity encoding with an edge-aware graph attention network on a region-aware bipartite Q–A graph, complemented by a constraint loss to enforce one-to-one Q–A mappings. The approach uses multi-granular region bounding boxes (entity-level, paragraph-level, tabular-level) to compute edge representations and applies a biaffine classifier for binary relation prediction, achieving state-of-the-art performance on FUNSD, XFUND, and the newly proposed DiverseForm across supervised, multitask, and zero-shot cross-lingual transfer settings. Ablation studies confirm the benefits of joint node-edge embeddings and region-aware information, and the DiverseForm dataset enables cross-domain evaluation, highlighting the practical impact of region-aware VRD relation extraction in multilingual and multi-domain contexts.

Abstract

Current research in form understanding predominantly relies on large pre-trained language models, necessitating extensive data for pre-training. However, the importance of layout structure (i.e., the spatial relationship between the entity blocks in the visually rich document) to relation extraction has been overlooked. In this paper, we propose REgion-Aware Relation Extraction (RE) that leverages region-level spatial structure among the entity blocks to improve their relation prediction. We design an edge-aware graph attention network to learn the interaction between entities while considering their spatial relationship defined by their region-level representations. We also introduce a constraint objective to regularize the model towards consistency with the inherent constraints of the relation extraction task. Extensive experiments across various datasets, languages and domains demonstrate the superiority of our proposed approach.
Paper Structure (30 sections, 15 equations, 7 figures, 8 tables)

This paper contains 30 sections, 15 equations, 7 figures, 8 tables.

Figures (7)

  • Figure 1: Example of entity and relation extraction from a visually rich document. The colored boxes represent three categories of semantic entities and the arrows represent relations between them.
  • Figure 2: Overview of the REgion-level Relation Extraction (RE$^2$) framework. A bipartite graph of Question and Answer entities is constructed. In the eGAT layer, the representation of each entity is updated based on the attention scores of its first-order neighbors.
  • Figure 3: Entity level bounding box (for question and answer entities) are shown in blue, paragraph-level bounding box in red and tabular-based bounding box in green.
  • Figure 4: Domain distribution of DiverseForm.
  • Figure 5: Visualization of paragraph-level regions (a), tabular regions (b) and predictions (c) for a Portuguese form in XFUND.
  • ...and 2 more figures