Table of Contents
Fetching ...

Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents

Hao Wang, Tang Li, Chenhui Chu, Nengjun Zhu, Rui Wang, Pinpin Zhu

TL;DR

This work introduces two new few-shot benchmarks built upon existing supervised benchmark datasets and proposes a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques that aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception.

Abstract

Key-value relations are prevalent in Visually-Rich Documents (VRDs), often depicted in distinct spatial regions accompanied by specific color and font styles. These non-textual cues serve as important indicators that greatly enhance human comprehension and acquisition of such relation triplets. However, current document AI approaches often fail to consider this valuable prior information related to visual and spatial features, resulting in suboptimal performance, particularly when dealing with limited examples. To address this limitation, our research focuses on few-shot relational learning, specifically targeting the extraction of key-value relation triplets in VRDs. Given the absence of a suitable dataset for this task, we introduce two new few-shot benchmarks built upon existing supervised benchmark datasets. Furthermore, we propose a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques. This approach aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception. Experimental results demonstrate the effectiveness of our proposed method by showcasing its ability to outperform existing methods. This study also opens up new possibilities for practical applications.

Towards Human-Like Machine Comprehension: Few-Shot Relational Learning in Visually-Rich Documents

TL;DR

This work introduces two new few-shot benchmarks built upon existing supervised benchmark datasets and proposes a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques that aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception.

Abstract

Key-value relations are prevalent in Visually-Rich Documents (VRDs), often depicted in distinct spatial regions accompanied by specific color and font styles. These non-textual cues serve as important indicators that greatly enhance human comprehension and acquisition of such relation triplets. However, current document AI approaches often fail to consider this valuable prior information related to visual and spatial features, resulting in suboptimal performance, particularly when dealing with limited examples. To address this limitation, our research focuses on few-shot relational learning, specifically targeting the extraction of key-value relation triplets in VRDs. Given the absence of a suitable dataset for this task, we introduce two new few-shot benchmarks built upon existing supervised benchmark datasets. Furthermore, we propose a variational approach that incorporates relational 2D-spatial priors and prototypical rectification techniques. This approach aims to generate relation representations that are more aware of the spatial context and unseen relation in a manner similar to human perception. Experimental results demonstrate the effectiveness of our proposed method by showcasing its ability to outperform existing methods. This study also opens up new possibilities for practical applications.
Paper Structure (30 sections, 4 equations, 7 figures, 5 tables, 1 algorithm)

This paper contains 30 sections, 4 equations, 7 figures, 5 tables, 1 algorithm.

Figures (7)

  • Figure 1: Illustration of the distinction between our work and previous works Popovic2022FewShotDR during an episode in the few-shot relational learning scenario. In the testing task, we aim to extract triplets that consist of entities and relation types for a given query document. Notably, this task involves a different set of relation types compared to the training task and is performed on a novel collection of documents. Conventional approaches typically rely on off-the-shelf OCR engines to extract text from original document images and solely rely on text features for extracting relational triplets. In contrast, our work takes a human-like perspective and leverages multimodal information to effectively extract the relational triplets. While we use simple receipts with well-aligned layouts to illustrate the idea, it is crucial to acknowledge that real-world scenarios are considerably more complex and challenging.
  • Figure 2: Copying, masking and sampling.
  • Figure 3: The double-humped distribution of specific key-value types on a document page, including: "Shipper" and "Weight" in the SEAB dataset and "Menu" and "Total" in the CORD dataset.
  • Figure 4: Our model architecture comprises three key components: ROI regression, prototypical rectification, and proximity-based classification. It can generate more robust representations that encompass multiple modalities by directing its attention to relevant regions (by predicting the explicit ROI windows) and learn high-dimensional relation-agnostic features using prototypical rectification, helping adaptation to new relation classes.
  • Figure 5: Semantic similarity heatmap for the entity representations generated by using BERT and LayoutLMv2.
  • ...and 2 more figures